In the following sections, we first present some previous work on gender recognition Section 2.

For each system, we provided the first N principal components for various N. Where Cohen assumes the two distributions have the same standard deviation, we use the sum of the two, practically always different, standard deviations.

For the character n-grams, our first observation is that the normalized versions are always better than the original versions.

Although we agree with Nguyen et al. And also some more negative emotions, such as haat hate and pijn pain.

Experimental Data and Evaluation In this section, we first describe the corpus that we used in our experiments Section 3. Roughly speaking, it classifies on the basis of noticeable over- and underuse of specific features.

For each setting and author, the systems report both a selected class and a floating point score, which can be used as a confidence score. After this, we examine the classification of individual authors Section 5.

When we look at his tweets, we see a kind of financial blog, which is an exception in the population we have in our corpus. Bigrams Two adjacent tokens.

For the other feature types, we see some variation, but most scores are found near the top of the lists. For those techniques where hyperparameters need to be selected, we used a leave-one-out strategy on the test material.

Gender recognition has also already been applied to Tweets. With these main choices, we performed a grid search for well-performing hyperparameters, with the following investigated values: Most of them rely on the tokenization described above. In the example tweet, e. We first describe the features we used Section 4.

Top rankingfemales insvr ontokenunigrams, with ranksand scoresforsvr with various feature types. Results In this section, we will present the overall results of the gender recognition.

We will illustrate the options we explored with the Hahaha We start with the accuracy of the various features and systems Section 5. Currently the Dating voor 50 plussers is getting an impulse for further development now that vast data sets of user generated data is becoming available.

Then we will focus on the effect of preprocessing the input vectors with PCA Section 5. Apparently, in our sample, politics is a male thing.

Skip bigrams Two tokens in the tweet, but not adjacent, without any restrictions on the gap size. Unigrams Single tokens, similar to the top function words, but then using all tokens instead of a subset.

The age component of the system is described in Nguyen et al. Recognition accuracy as a function of the number of principal components provided to the systems, using normalized character 5-grams.

And actually checking the existence of a proposed URL was computationally infeasible for the amount of text we intended to process. Trigrams Three adjacent tokens. In scores, too, we see far more variation.

In this section, we want to investigate how strong this dependency may have been. However, as research shows a higher number of female users in all as well Heil and Piskorskiwe do not view this as a problem.

The use of syntax or even higher level features is for now impossible as the language use on Twitter deviates too much from standard Dutch, and we have no tools to provide reliable analyses.

For this reason, we did all classification with SVR and LP twice, once building a male model and once a female model. Another system that predicts the gender for Dutch Twitter users is TweetGenie http:

