They used lexical features, and present a very good breakdown of various word types. When using all user tweets, they reached an accuracy of An interesting observation is that there is a clear class of misclassified users who have a majority of opposite gender users in their social network. When adding more information sources, such as profile fields, they reach an accuracy of These statistics are derived from the users profile information by way of some heuristics.
For gender, the system checks the profile for about common male and common female first names, as well as for gender related words, such as father, mother, wife and husband. If no cue is found in a user s profile, no gender is assigned. Another system that predicts the gender for Dutch Twitter users is TweetGenie http: The age component of the system is described in Nguyen et al.
The authors apply logistic and linear regression on counts of token unigrams occurring at least 10 times in their corpus. The conclusion is not so much, however, that humans are also not perfect at guessing age on the basis of language use, but rather that there is a distinction between the biological and the social identity of authors, and language use is more likely to represent the social one cf.
Although we agree with Nguyen et al. Experimental Data and Evaluation In this section, we first describe the corpus that we used in our experiments Section 3. Then we outline how we evaluated the various strategies Section 3. From this material, we considered all tweets with a date stamp in and In all, there were about 23 million users present.
This restriction brought the number of users down to about , We then progressed to the selection of individual users. We aimed for users.
We selected of these so that they get a gender assignment in TwiQS, for comparison, but we also wanted to include unmarked users in case these would be different in nature. All users, obviously, should be individuals, and for each the gender should be clear. From the about , users who are assigned a gender by TwiQS, we took a random selection in such a manner that the volume distribution i. We checked gender manually for all selected users, mostly on the basis 3.
As in our own experiment, this measurement is based on Twitter accounts where the user is known to be a human individual. However, as research shows a higher number of female users in all as well Heil and Piskorski , we do not view this as a problem.
From each user s tweets, we removed all retweets, as these did not contain original text by the author. Then, as several of our features were based on tokens, we tokenized all text samples, using our own specialized tokenizer for tweets.
Apart from normal tokens like words, numbers and dates, it is also able to recognize a wide variety of emoticons. The tokenizer is able to identify hashtags and Twitter user names to the extent that these conform to the conventions used in Twitter, i. URLs and addresses are not completely covered. The tokenizer counts on clear markers for these, e.
Assuming that any sequence including periods is likely to be a URL provesunwise, given that spacing between normal wordsis often irregular. And actually checking the existence of a proposed URL was computationally infeasible for the amount of text we intended to process. Finally, as the use of capitalization and diacritics is quite haphazard in the tweets, the tokenizer strips all words of diacritics and transforms them to lower case. For those techniques where hyperparameters need to be selected, we used a leave-one-out strategy on the test material.
For each test author, we determined the optimal hyperparameter settings with regard to the classification of all other authors in the same part of the corpus, in effect using these as development material. In this way, we derived a classification score for each author without the system having any direct or indirect access to the actual gender of the author.
We then measured for which percentage of the authors in the corpus this score was in agreement with the actual gender. These percentages are presented below in Section Profiling Strategies In this section, we describe the strategies that we investigated for the gender recognition task. As we approached the task from a machine learning viewpoint, we needed to select text features to be provided as input to the machine learning systems, as well as machine learning systems which are to use this input for classification.
We first describe the features we used Section 4. Then we explain how we used the three selected machine learning systems to classify the authors Section 4. The use of syntax or even higher level features is for now impossible as the language use on Twitter deviates too much from standard Dutch, and we have no tools to provide reliable analyses.
However, even with purely lexical features, 4. Several errors could be traced back to the fact that the account had moved on to another user since We could have used different dividing strategies, but chose balanced folds in order to give a equal chance to all machine learning techniques, also those that have trouble with unbalanced data.
If, in any application, unbalanced collections are expected, the effects of biases, and corrections for them, will have to be investigated. Most of them rely on the tokenization described above. We will illustrate the options we explored with the Hahaha Top Function Words The most frequent function words see kestemont for an overview. We used the most frequent, as measured on our tweet collection, of which the example tweet contains the words ik, dat, heeft, op, een, voor, and het.
Then, we used a set of feature types based on token n-grams, with which we already had previous experience Van Bael and van Halteren For all feature types, we used only those features which were observed with at least 5 authors in our whole collection for skip bigrams 10 authors. Unigrams Single tokens, similar to the top function words, but then using all tokens instead of a subset.
In the example tweet, we find e. Bigrams Two adjacent tokens. In the example tweet, e. Trigrams Three adjacent tokens. Skip bigrams Two tokens in the tweet, but not adjacent, without any restrictions on the gap size. Finally, we included feature types based on character n-grams following kjell et al. We used the n-grams with n from 1 to 5, again only when the n-gram was observed with at least 5 authors. However, we used two types of character n-grams. The first set is derived from the tokenizer output, and can be viewed as a kind of normalized character n-grams.
Normalized 1-gram About features. Normalized 3-gram About 36K features. Normalized 4-gram About K features. Normalized 5-gram About K features. The second set of character n-grams is derived from the original tweets. This type of character n-gram has the clear advantage of not needing any preprocessing in the form of tokenization. Original 1-gram About features. Be Original 3-gram About 77K features.
Original 4-gram About K features. Original 5-gram About K features. Again, we decided to explore more than one option, but here we preferred more focus and restricted ourselves to three systems. Our primary choice for classification was the use of Support Vector Machines, viz.
We chose Support Vector Regression ν-svr to be exact with an RBF kernel, as it had shown the best results in several research projects e.
With these main choices, we performed a grid search for well-performing hyperparameters, with the following investigated values: The second classification system was Linguistic Profiling LP; van Halteren , which was specifically designed for authorship recognition and profiling.
Roughly speaking, it classifies on the basis of noticeable over- and underuse of specific features. Before being used in comparisons, all feature counts were normalized to counts per words, and then transformed to Z-scores with regard to the average and standard deviation within each feature.
Here the grid search investigated: As the input features are numerical, we used IB1 with k equal to 5 so that we can derive a confidence value. The only hyperparameters we varied in the grid search are the metric Numerical and Cosine distance and the weighting no weighting, information gain, gain ratio, chi-square, shared variance, and standard deviation.
However, the high dimensionality of our vectors presented us with a problem. For such high numbers of features, it is known that k-nn learning is unlikely to yield useful results Beyer et al. This meant that, if we still wanted to use k-nn, we would have to reduce the dimensionality of our feature vectors.
For each system, we provided the first N principal components for various N. In effect, this N is a further hyperparameter, which we varied from 1 to the total number of components usually , as there are authors , using a stepsize of 1 from 1 to 10, and then slowly increasing the stepsize to a maximum of 20 when over Rather than using fixed hyperparameters, we let the control shell choose them automatically in a grid search procedure, based on development data.
When running the underlying systems 7. As scaling is not possible when there are columns with constant values, such columns were removed first. For each setting and author, the systems report both a selected class and a floating point score, which can be used as a confidence score. In order to improve the robustness of the hyperparameter selection, the best three settings were chosen and used for classifying the current author in question.
For LP, this is by design. A model, called profile, is constructed for each individual class, and the system determines for each author to which degree they are similar to the class profile. For SVR, one would expect symmetry, as both classes are modeled simultaneously, and differ merely in the sign of the numeric class identifier.
However, we do observe different behaviour when reversing the signs. For this reason, we did all classification with SVR and LP twice, once building a male model and once a female model. For both models the control shell calculated a final score, starting with the three outputs for the best hyperparameter settings. It normalized these by expressing them as the number of non-model class standard deviations over the threshold, which was set at the class separation value.
The control shell then weighted each score by multiplying it by the class separation value on the development data for the settings in question, and derived the final score by averaging. It then chose the class for which the final score is highest. In this way, we also get two confidence values, viz. Results In this section, we will present the overall results of the gender recognition.
We start with the accuracy of the various features and systems Section 5. Then we will focus on the effect of preprocessing the input vectors with PCA Section 5. After this, we examine the classification of individual authors Section 5. For the measurements with PCA, the number of principal components provided to the classification system is learned from the development data.
Below, in Section 5. Starting with the systems, we see that SVR using original vectors consistently outperforms the other two. For only one feature type, character trigrams, LP with PCA manages to reach a higher accuracy than SVR, but the difference is not statistically significant.
For SVR and LP, these are rather varied, but TiMBL s confidence value consists of the proportion of selected class cases among the nearest neighbours, which with k at 5 is practically always 0. The class separation value is a variant of Cohen s d Cohen Where Cohen assumes the two distributions have the same standard deviation, we use the sum of the two, practically always different, standard deviations. Accuracy Percentages for various Feature Types and Techniques.
In fact, for all the tokens n-grams, it would seem that the further one goes away from the unigrams, the worse the accuracy gets. An explanation for this might be that recognition is mostly on the basis of the content of the tweet, and unigrams represent the content most clearly. Possibly, the other n-grams are just mirroring this quality of the unigrams, with the effectiveness of the mirror depending on how well unigrams are represented in the n-grams.
For the character n-grams, our first observation is that the normalized versions are always better than the original versions. This means that the content of the n-grams is more important than their form. This is in accordance with the hypothesis just suggested for the token n-grams, as normalization too brings the character n-grams closer to token unigrams. The best performing character n-grams normalized 5-grams , will be most closely linked to the token unigrams, with some token bigrams thrown in, as well as a smidgen of the use of morphological processes.
However, we cannot conclude that what is wiped away by the normalization, use of diacritics, capitals and spacing, holds no information for the gender recognition. To test that, we would have to experiment with a new feature types, modeling exactly the difference between the normalized and the original form.
This number was treated as just another hyperparameter to be selected. As a result, the systems accuracy was partly dependent on the quality of the hyperparameter selection mechanism.
In this section, we want to investigate how strong this dependency may have been. Recognition accuracy as a function of the number of principal components provided to the systems, using token unigrams. Figures 1, 2, and 3 show accuracy measurements for the token unigrams, token bigrams, and normalized character 5-grams, for all three systems at various numbers of principal components.
For the unigrams, SVR reaches its peak Interestingly, it is SVR that degrades at higher numbers of principal components, while TiMBL, said to need fewer dimensions, manages to hold on to the recognition quality. LP peaks much earlier However, it does not manage to achieve good results with the principal components that were best for the other two systems.
Furthermore, LP appears to suffer some kind of mathematical breakdown for higher numbers of components. Although LP performs worse than it could on fixed numbers of principal components, its more detailed confidence score allows a better hyperparameter selection, on average selecting around 9 principal components, where TiMBL chooses a wide range of numbers, and generally far lower than is optimal. We expect that the performance with TiMBL can be improved greatly with the development of a better hyperparameter selection mechanism.
For the bigrams Figure 2 , we see much the same picture, although there are differences in the details. SVR now already reaches its peak TiMBL peaks a bit later at with And LP just mirrors its behaviour with unigrams. LP keeps its peak at 10, but now even lower than for the token n-grams However, all systems are in principle able to reach the same quality i. Even with an automatically selected number, LP already profits clearly Recognition accuracy as a function of the number of principal components provided to the systems, using token bigrams.
And TiMBL is currently underperforming, but might be a challenger to SVR when provided with a better hyperparameter selection mechanism. We will focus on the token n-grams and the normalized character 5-grams.
As for systems, we will involve all five systems in the discussion. However, our starting point will always be SVR with token unigrams, this being the best performing combination. We will only look at the final scores for each combination, and forgo the extra detail of any underlying separate male and female model scores which we have for SVR and LP; see above. When we look at his tweets, we see a kind of financial blog, which is an exception in the population we have in our corpus.
The exception also leads to more varied classification by the different systems, yielding a wide range of scores. SVR tends to place him clearly in the male area with all the feature types, with unigrams at the extreme with a score of SVR with PCA on the other hand, is less convinced, and even classifies him as female for unigrams 1.
Figure 4 shows that the male population contains some more extreme exponents than the female population. The most obvious male is author , with a resounding Looking at his texts, we indeed see a prototypical young male Twitter user: From this point on in the discussion, we will present female confidence as positive numbers and male as negative. Recognition accuracy as a function of the number of principal components provided to the systems, using normalized character 5-grams.
All systems have no trouble recognizing him as a male, with the lowest scores around 1 for the top function words. If we look at the rest of the top males Table 2 , we may see more varied topics, but the wide recognizability stays. Unigrams are mostly closely mirrored by the character 5-grams, as could already be suspected from the content of these two feature types.
For the other feature types, we see some variation, but most scores are found near the top of the lists. Feature type Unigram 1: Top Function 4: On the female side, everything is less extreme.
The best recognizable female, author , is not as focused as her male counterpart. There is much more variation in the topics, but most of it is clearly girl talk of the type described in Section 5. In scores, too, we see far more variation. Even the character 5-grams have ranks up to 40 for this top Another interesting group of authors is formed by the misclassified ones. Taking again SVR on unigrams as our starting point, this group contains 11 males and 16 females.
We show the 5 most Confidence scores for gender assignment with regard to the female and male profiles built by SVR on the basis of token unigrams. The dashed line represents the separation threshold, i. The dotted line represents exactly opposite scores for the two genders. Top rankingfemales insvr ontokenunigrams, with ranksand scoresforsvr with various feature types. Top Function 9: With one exception author is recognized as male when using trigrams , all feature types agree on the misclassification.
This may support ourhypothesis that allfeature types aredoingmore orlessthe same. But it might alsomean that the gender just influences all feature types to a similar degree. In addition, the recognition is of course also influenced by our particular selection of authors, as we will see shortly.
Apart from the general agreement on the final decision, the feature types vary widely in the scores assigned, but this also allows for both conclusions. The male which is attributed the most female score is author On re examination, we see a clearly male first name and also profile photo. However, his Twitter network contains mostly female friends. This apparently colours not only the discussion topics, which might be expected, but also the general language use. The unigrams do not judge him to write in an extremely female way, but all other feature types do.
When looking at his tweets, we This has also been remarked by Bamman et al. There is an extreme number of misspellings even for Twitter , which may possibly confuse the systems models. The most extreme misclassification is reserved for a female, author This turns out to be Judith Sargentini, a member of the European Parliament, who tweets under the 14 Although clearly female, she is judged as rather strongly male In this case, it would seem that the systems are thrown off by the political texts.
If we search for the word parlement parliament in our corpus, which is used 40 times by Sargentini, we find two more female authors each using it once , as compared to 21 male authors with up to 9 uses. Apparently, in our sample, politics is a male thing.
We did a quick spot check with author , a girl who plays soccer and is therefore also misclassified often; here, the PCA version agrees with and misclassified even stronger than the original unigrams versus. In later research, when we will try to identify the various user types on Twitter, we will certainly have another look at this phenomenon. Are they mostly targeting the content of the tweets, i. In this section, we will attempt to get closer to the answer to this question.
Again, we take the token unigrams as a starting point. However, looking at SVR is not an option here. Because of the way in which SVR does its classification, hyperplane separation in a transformed version of the vector space, it is impossible to determine which features do the most work.
Instead, we will just look at the distribution of the various features over the female and male texts. Figure 5 shows all token unigrams.
The ones used more by women are plotted in green, those used more by men in red. The position in the plot represents the relative number of men and women who used the token at least once somewhere in their tweets.
However, for classification, it is more important how often the token is used by each gender. We represent this quality by the class separation value that we described in Section 4.
As the separation value and the percentages are generally correlated, the bigger tokens are found further away from the diagonal, while the area close to the diagonal contains mostly unimportant and therefore unreadable tokens.
On the female side, we see a representation of the world of the prototypical young female Twitter user. And also some more negative emotions, such as haat hate and pijn pain. Next we see personal care, with nagels nails , nagellak nail polish , makeup makeup , mascara mascara , and krullen curls.
Clearly, shopping is also important, as is watching soaps on television gtst. The age is reconfirmed by the endearingly high presence of mama and papa. As for style, the only real factor is echt really. The word haar may be the pronoun her, but just as well the noun hair, and in both cases it is actually more related to the Identity disclosed with permission.
And by TweetGenie as well. An alternative hypothesis was that Sargentini does not write her own tweets, but assigns this task to a male press spokesperson. Your IP address is like a phone number, other people can use your IP address to identify you.
Our service hides your IP address. This protects you from cyberstalkers, identity thieves, and any that may want to track or harrass you. Communicate in forums, email, and more without the worry of someone stalking or targeting you. Without someone collecting data about everything you do online, and without fear of Internet participation jeopardizing current or future employment or other opportunities.
Did you know that when you send an e-mail, you may not only be sending your location data, but information about the device you are sending it from within the mail headers? Our services protects you from this by removing all identifying information about location, device, and software used from all mail you send, regardless of where you are, what device you are using to send the e-mail phone, tablet, or computer , or what mail app you are using.
Use our VPN from your phone, tablet, or computer to make your skype or other video or VoIP conversations more private when using unknown networks, or simply to access US netflix, hulu, or other streaming video audio, or any Internet traffic when in locations that block or restrict it.
We are open source standards based. We support all popular methods of encryption across all services. Being standards based means that your favorite application will work with us. The service works with windows, OSX, and unix. It works with phones, tablets and computers. It is operating system independent. Full encryption means nobody can tell what you are mailing, receiving, surfing, instant messaging, or anything else you may do on the Internet.
Vingeren sex behaarde negerinnen kutThe men, on the other hand, seem to be more interested in computers, leading to important content words like software and game, and correspondingly more determiners and prepositions. We expect that the performance with TiMBL can be improved greatly with the development of a better hyperparameter selection mechanism. Another interesting group of authors is formed by the misclassified ones,
vingeren sex behaarde negerinnen kut. Invloed van het aantal kinderen op de seksdrive en relatievoorkeur M. Bigrams Two adjacent tokens. An interesting observation is that there is a clear class of misclassified users who have a majority of opposite gender users in their social network.