Gender Classification for Social Media Author Profiling


Gender Classification for Social Media Author Profiling

In this post, I am going to talk about gender classification models in the context of social media author profiling. I’ve been working in this field for more than one year now as part of my master thesis. I would like to thank my professors for their help during this time and for introducing me to this fascinating topic.


The first thing we need to know when talking about automatic profiling is that it’s a research area that has been gaining some relevance over the last few years. This field focuses on inferring social-demographic information about the user of an application or software service. There are many applications in different sectors such as security, marketing, e-commerce, fake profiles identification, and so on. Recent tasks included taking written texts as relevant information for the demographic profile construction. This proves that the language used in social network publications is a very relevant demographic indicator and can suggest the gender, age, or user’s origin from a psycho-linguistic and semantic analysis.


It’s quite common to include automatic profiling software in marketing analysis, forensic analysis, and early risk detection of cyberbullying, mental disorders, and so on. The main goal is to know the users and potential market, but also to help users that may suffer from depression, anorexia, cyberbullying, gambling addictions, and so on.


There are many different approaches based on the primary source of information. In our case, we have focused on inferring the gender from unstructured data, represented by the users’ social media posts.


When we were building our model, we needed to establish a workflow of experimentation. The one we followed is represented in Figure 1, and it’s quite a common workflow for this kind of task.

Figure 1. Workflow of experimentation. Published in Piot-Perez-Abadin., P., Martin-Rodilla., P., Parapar., J.: Experimental analysis of the relevance of features and effects on gender classification models for social media author profiling. In: Proceedings of the 16th International Conference on Evaluation of Novel Approaches to Software Engineering – ENASE, pp. 103–113. INSTICC, SciTePress (2021).


There are two main processes: the training and validation phases. Both consist of a preprocessing step, where we convert the raw data into a data frame, and a feature engineering stage, where we obtain the features from the corpus. In the training process, we split the dataset into train and test subsets, applying cross-validation, and the output will be the resultant classification model. The validation phase takes the classification models to predict unseen data and gives us the accuracy of the models.


One thing crucial when working on classification models is the datasets. In our case, we had labeled datasets, taken from PAN1 initiatives. This allowed us to be able to train our model with the knowledge of the prediction category, so we could apply a supervised learning approach. In total, we had 16541 users for training purposes and 1320 users for the validation phase. The validation dataset we used allowed us to compare our results with a clear baseline, as it was the dataset used in a PAN Author Profiling 2019 task.


In the next paragraphs, I am going to explain the process and the different stages of the process.


The first thing we needed to do was preprocess our data. Data preprocessing is a technique that involves transforming raw data into an understandable format. It includes data cleaning, data integration, data transformation, and data reduction. Therefore, we have transformed the different documents to have a homogeneous set, in CSV format. This file consists of an id column, a text column enclosing all tweets, and a gender column, which represents what we want to predict. We have decided not to carry out any additional preprocessing steps as it might suppose a loss of potentially relevant information. So we performed neither stemming nor removing stop words and special characters.


After that, a feature engineering stage took place. The main idea behind feature engineering is using domain knowledge to obtain features from the corpus. And, to find a characteristic pattern between diverse authors, we’ve used these features. We’ve grouped our features based on the intrinsic nature of the information involved. We’ve ended up having sociolinguistic features, sentiment analysis features, and topic modeling features.

  • Sociolinguistics is the study of the effect of any aspect of society on the way language is used. How the words are used can both reflect and reinforce social attitudes toward gender. This approach will help us to find a common generalized lexicon shared by males and another one for females, or to infer grammatical or discursive structures uses by gender. Some features we’ve included in our experiments refer to emojis, punctuation marks, repeated alphabets, cosine similarity, readability, self-referentiality, and part-of-speech tags, among many others.
  • Sentiment analysis is a field that analyzes people’s opinions, sentiments, emotions, and so on, from written language. It is usually used to determine whether a piece of writing is positive, neutral, or negative, and helps us understand the author’s experiences.
  • Latent Dirichlet Allocation is a topic model used to classify text in a document referring to a particular topic. It builds a topic-per-document model and a words-per-topic model. It is commonly used for automatically extracting and finding hidden patterns among the corpus. We’ve extracted the 20 most significant topics, defined with 20 words each.

Regarding the classification algorithms used in this work, as a binary classification task, we’ve taken into account different approaches making use of the following algorithms and performing hyper-parameter tuning.

  • We’ve trained a Random Forest algorithm, as it learns a non-linear decision boundary, trying to achieve higher accuracy scores than with a linear-based algorithm.
  • Also, we thought that Adaptive Boosting is a nice fit. Its main idea is to train predictors sequentially, each one trying to correct the errors of its predecessor, and numerous PAN competition participants used this algorithm in the bots detection task, so we thought it could be a good choice for the gender task too.
  • Finally, we trained a LightGBM model. It’s a gradient boosting framework using tree-based learning algorithms that focuses on the accuracy of results and its leading classification competitions.

We’ve carried out combinations of experiments to see which is the most effective classifier for the gender profiling task. We’ve trained the model making use of the PAN Author train dataset and PAN Celebrity dataset, and our classification results are based on the accuracy of the model on the PAN Author test dataset. We’ve run different tests with all the features, adding topic information, removing the less important features, etc.


Comparing the results, the best approach we’ve achieved was using the LightGBM learning algorithm with LDA topics and keeping all the features with the PAN Author task train and PAN Celebrity task datasets for training. We’ve got an accuracy of 0.7735, and this result can be compared with the “PAN Author Profiling 2019” task as our validation dataset was the same as the one used in that competition.


An interesting fact about these experiments is that we can notice the effects of each feature based on the prediction category. Figure 2 and Figure 3 allow us to see the feature importance, as the variables are ranked in descending order. Also the impact, the horizontal location shows whether the effect of that value is associated with a higher or lower prediction. And the graphs demonstrate the correlation.


Figure 2. LightGBM model output for males. Published in Piot-Perez-Abadin P., Martin-Rodilla P., Parapar J. (2022) Gender Classification Models and Feature Impact for Social Media Author Profiling. In: Ali R., Kaindl H., Maciaszek L.A. (eds) Evaluation of Novel Approaches to Software Engineering. ENASE 2021. Communications in Computer and Information Science, vol 1556. Springer, Cham.


Figure 3. LightGBM model output for females. Published in Piot-Perez-Abadin P., Martin-Rodilla P., Parapar J. (2022) Gender Classification Models and Feature Impact for Social Media Author Profiling. In: Ali R., Kaindl H., Maciaszek L.A. (eds) Evaluation of Novel Approaches to Software Engineering. ENASE 2021. Communications in Computer and Information Science, vol 1556. Springer, Cham.


For example, if you look at Figure 2, you can see that a high level of “articles” has a high and positive impact on males. The “high” comes from the red color, and the “positive” impact is shown on the X-axis. In other words, for males, for the “article count” feature, the higher the value, the model tends to classify the data as male. And, on the other hand, for the feature “love”, the lower the value, the model tends to classify the data as male.


Knowing this, we can see that the more articles, tags, number of emojis in general, and in particular the face tongue and face smiling, adjectives, the number of determiners, and longer words, the model tends to classify the profile as male. This also confirms previous works where articles, long words, face smiling emojis were a male indicator. Our work now extends this with several more new features.


In the case of females, love emojis, face affection and concern and monkey emojis, between exclamation marks, particles (‘s, not), interjections (psst, ouch, bravo, hello), nouns, repeated alphabets, and self-referentiality are traits that indicate the profile may be a female. Some previous works had identified that sequence of exclamation marks, love emojis, repeated letters and pronouns, the third-person ones, are female indicative. Now our work also concluded this, and also offers a more extensive list of socio-linguistic features by gender.


I would like to highlight that this study shows some improvements in accuracy, compared with some of the reported results on PAN tasks. So we can say that linguistic features play an important role in gender classification for author profiling.


In conclusion, a pure sociolinguistic approach can prove a good result, but combining the mentioned features with other studies can improve accuracy. Also, I would like to say that this study, or similar, can be applied to infer other socio-economic variables.


If you are interested in this research work and want to learn more about this topic, check out these papers and