This is the old draft version.

Full PDF version can be found here.

Vocabulary / Terms used

  • Mention – As used in this Bachelor’s thesis, a mention is a Twitter user’s username preceded by “@” used inside a Tweet. (@example)
  • Hashtag - A word or phrase preceded by a hash sign (“#”). Used inside messages in many social networks to identify a specific topic.
  • POS – Part of speech – a category to which a word belongs, usually divided on the basis of their meaning, form or syntactic function, such as (in English) noun, pronoun, verb, adjective etc. 1
  • SVM, RF, …


Basic results

TODO move this to conclusion, here add my basic goals? In this Bachelor’s thesis, I attempted to detect the native language of the authors of tweets written in English. I approached this as a classification problem: first, I collected tweets in English, written from 5 different geographical areas, then replaced mentions, hashtags and URIs by tags. After this, using a combination of features like part of speech, punctuation, POS n-grams and word n-grams trained two classifiers – one Random Forest, and the other a Support Vector Machine. Lastly, I trained a meta-classifier (a deep neural network), that used predictions from both and delivered a final prediction, improving the results by about 1%. Two different feature sets were used:

  • only parts of speech, punctuation and articles, which yielded 46% accuracy over 5 categories (with a 20% trivial baseline accuraty)
  • parts of speech, punctuation, articles and word tokens, with 59% accuracy over the same categories.

I separated both because of two possible different goals in such a classification. The first one is an attempt to classify based on pure language and grammatical features that are independent from the content and word choice of the tweet. A 46% accuracy with 5 categories means that people with different language backgrounds do use different parts of speech, punctuation, articles, and that some degree of information about an author may be gathered using these features alone. The second goal was to classify using all available linguistical features. Mentions, hashtags and URIs were hidden while keeping a tag in their place, because they might allow a way to “cheat” during the classification (it doesn’t take machine learning to guess that someone mentioning @timesofindia is probably located in India). There are a number of social-network-specific features that would have improved these results, but this would not have been pure NLI. Lastly, using actual words improves results by picking other cues. If there’s an election in India someone speaking about elections is more likely to be from India, even though this would not have increased the accuracy on a dataset collected after a month. And someone mentioning Riyadh (the capital of Saudi Arabia) probably speaks Arabic, always. The first goal was an attempt to see how far would I get by avoiding all such cues completely.

Native Language Identification is a complex task even for humans, and usually it’s done on much bigger texts (at least 100 words per instance, instead of the more challenging mean of 18 in this thesis) and datasets with less noise, so a fair comparison to existing results is not possible.


This topic interested me because I noticed that the way people from a different linguistical background use a foreign language is different, and this difference goes deeper than accent or phonetics. A person’s culture shapes the way they see the world (for example, perceptions of time 2), and some argue that even the native language of a person does, insofar as the two can be separated, even though the latter idea (linguistic determinism) is much more controversial. (An often-quoted example of this is how native Spanish speakers were more likely to describe a bridge as “big”, “long”, “strong” and German speakers as “beautiful”, “elegant” and “fragile”. A bridge’s grammatical gender is feminine in German and masculine in Spanish3).

Even if languages don’t shape how we see the world, different languages use different paradigms to describe it. For example, in English the difference between “cup” and “glass” is mainly based on material, while in Russian shape plays a bigger role. Another one is how in German a dog “beißt” and a fly “sticht”, and in Russian one single verb (“кусать”) describes both (embeddings in Machine Learning would be a fascinating way to study how close are similar concepts in different languages). If one learns a foreign language that divides the world into categories in different ways than their mother tongue, the lack of 1-to-1 correspondence between concepts might create errors. (“Mother tongues”4 is an essay exploring these connections in Sanskrit).

I was always interested in the way different people use language, and people of different linguistical backgrounds using similar words and making similar errors was one of the things I kept noticing. In the case of Germany and the English language, examples range from both obviously derived from German (“plaything” (=Spielzeug), “we meet us at”, overuse of punctuation) to subtler ones (in my experiences German people are much more likely to call “dinner” a noon meal as opposed to an evening one, as would have been natural for me). This bachelor’s thesis is one way to explore these differences using machine learning and a user-generated dataset.

Native language identification

Native-language identification (NLI, NLID), also known as First language identification, is the task of determining an author’s native language (L1) based only on their writings in a second language (L2). This is usually framed as a supervised classification task where the set of L1 is known. It works under the assumption that an author’s linguistic background will dispose them towards particular language production patterns in their L2. 5

Related fields and topics include cross-linguistic interference (CLI, also “language transfer”), that describes the effects of one’s mother tongue on the aquisition of other languages. 6, and is part of Second language aquisition (itself part of linguistics).

NLI is a relatively recent but rapidly growing area of research, and has many applications. The identification of typical usage patterns of L1 speakers can influence the way foreign languages are taught, and allow to create teaching resources tailored to the native language of the learners. Another practical area where NLI is applied is forensic linguistics and authorship identification (for example, in case of a ransom note and no known suspects, clues about the writer’s linguistic background might be valuable).

  • NLI supervised task
  • Ensemble learning for NLI
  • NLI on user-generated Reddit data
  • TODO is istate of the art



Random Forest


Neural networks

Ensemble learning

Ensemble learning is the use of multiple learning algorithms to obtain better results than the use of one algorithm alone. There are multiple types of ensembles. TODOM basic types of ensembles, boosting, bagging, tradeoffs


Bag of words


POS tags





Precision, recall


F-score for multi-class classification (micro, macro)


Confusion matrix

TODOM Visualization based on Confusion matrix — scikit-learn 0.21.3 documentation

Construction of training data


The dataset contains geo-tagged tweets from Twitter. Twitter7 is a social network with 126 million active users8, on which users interact with each other with short messages (tweets). Tweets originally were restricted to 140 characters, but in 2017 the limit was doubled to 280 for most languages. Users can optionally specify a location in their tweets, and when searching tweets can be filtered by their location. The location can either be specified as exact coordinates or a Twitter “Place”, which is a polygon, and has additional semantic information connected to it, like a city name910.

Additionally, Twitter automatically identifies the language of tweets, and it’s possible to filter tweets by their location (if they are geo-enabled) and their language at the same time.


  • Tweets are ephemeral in nature, and Twitter is not the place for balanced in-depth analysis. The “here-and-now” of the content may mean that it’s less thought-out and less studied, which may influence the way the language is used. NLI is very often done on essays written in a foreign language for the purpose of language testing, where an effort is made to use correct grammar and punctuation. User-generated social media content may have the advantage of having L1-grammar and language patterns more clearly visible, since less effort and attention is dedicated to them.
  • As opposed to usual NLI-datasets, which may be contextually limited (essays are written on certain topics), user-generated content may be seen as a slice-of-life source of material that is closer to what people think and experience in real life – very different people in different experiences and different lives. While it may be argued that Twitter users are not a representative demographic, it’s not-representative in a different way from the others (see 11 for one list of NLI corpora; most are essays written either in the context of tests such as TOEFL or by students).


A user-generated dataset like the one used here is not a “clean” dataset.

Twitter-specific issues

  • For the purposes of this thesis, we assume that the tweet location is a good proxy of the first language of the author, which is not always the case — people writing English tweets while being located in a certain country does ot mean that they are a native speaker of that country’s language. This will impact the results of any classification. One way to overcome this is to use the last N tweets of a user, check how many of them were written in a certain country, and assume that it’s their country, and not that they, say, are on vacation. Or to assume that followers of accounts like @timesofindia are Indians. Or classify based on Tweets of users who mostly tweet in their L1 (as detected by Twitter), and use their tweets in English for the dataset.

  • Not all tweets are written by real people — Twitter has a big number of bots and tweets posted automatically. By some estimates, as much as 24% of tweets are created by bots 12, and 9% to 15% of active users are bots 13. Some of them are weather bots, financial aggregation bots, or novelty bots 14 with text that is not representative of real life language used by L1 native speakers.
  • In line with the point above, tweets can be automatically generated. For example, some users choose to configure automatic cross-posting of their activity in other social networks. This creates tweets that are very similar to each other (“%@username% just posted a photo on his Instagram account: %link%”). Twitter used to have an API feature to show and/or filter posts made from other clients, but this option was removed15 as part of a policy aimed at focusing on their official website and clients (as opposed to third party apps)16. This means that such content has to be found and filtered manually, which was attempted but should not be considered an absolutely effective solution. In my dataset, which initially contained 341353 Tweets, 5889 (~1.7%) had identical text.

Issues stemming from the type of data used

  • Tweets (in the general case) can be up to 280 characters long, but in fact are usually much shorter. This means that a tweet may not contain enough information to distinguish the L1 of the author. As mentioned above, usually NLI is done on much longer texts, such as essays. Figure X shows an example of such tweets, note the scarcity of material and the length. AS seen by the length distribution of TODO Figure 2, texts around 50 characters were most likely to be used.
Figure X. A selection of tweets.
Figure X. A selection of tweets.
  • Twitter’s language detection is not perfect, especially with shorter texts. I did not quantify this, but visually about 5-7% of the supposedly English tweets were not in English (Brazilian/Portuguese seemed to be the language most often misclassified as English by Twitter).

One way to overcome part of these disadvantages would be to get a number of tweets from the same user, thereby getting a synthetic dataset of longer texts, or getting longer texts of different users of the same category.

Additional remarks

Usually, NLI done on spoken text transcription is much less precise than NLI done on essays or written text, some sources report as much as a 10% difference 17. There may be various causes for this, one of them may be that transcribed spoken text contains fewer features such as punctuation, and that people, when speaking, tend to use much simpler words and grammatical constructions 17. While equating tweets to spoken text transcription would be questionable, Tweets are usually written much closer to how people actually speak. The format is quite informal, and most people would not spend too much effort on correct grammar and punctuation for a tweet (this more typical for essays, and it’s essays that make up typical NLI datasets, such as the TOEFL11 NLI corpus 18).

Mitigating part of the disadvantages mentioned in this chapter was not attempted, but it would create a much better dataset and should not be hard. It would be an extremely interesting route for further research, because combining most of the ideas suggested would overcome part of the issues of this dataset while retaining most advantages.


The dataset was collected in April and August 2019 in a period of 4-5 days. It contains Tweets from regions delimited by the bounding boxes (GPS coordinates) in the English language.


For communication with the Twitter API, the tweepy (“An easy-to-use Python library for accessing the Twitter API”)19 package has been used. The final script used for collecting the tweets from a Twitter Stream 20 was heavily based on GeoSearch-Tweepy21, a script which contained examples of how to filter Twitter Streams by geolocation data.

Countries and bounding boxes

The countries used in this thesis are India, Saudi Arabia, Brazil, Mexico, and Great Britain. These countries were chosen because of their large number of Twitter users (to make data collection easier) and because they represent languages from different language families.

Brazil and Mexico have different languages that nonetheless belong to the same language family, and if our assumptions and process are valid, they should have the highest mutual mis-classification.

The bounding box for Great Britain also contains Ireland, since they are close linguistically (and fit neatly into one bounding box). In India a lot of languages are spoken, but most belong to two unrelated language families: Indo-European->Indo-Iranian and Dravidian, 78% and 20% of speakers respectively.

Country Bounding box Language Language family
India (67.46, 5.43, 90.71, 27.21) 122 major languages 78% Indo-European->Indo-Iranian, 19.64% Dravidian
Saudi Arabia (34.77, 12.86, 49.84, 30.19), (48.1, 13.93, 60.25, 24.77) Arabic Afro-Asiatic -> Semitic
Brazil (-58.55, -30.11, -35.26, 2.5), (-67.3, -13.03, -34.38, 1.53) Portuguese Indo-European -> Italic
Mexico (-112.59, 17.98, -85.38, 27.75) Spanish (de facto) Indo-European -> Italic
Great Britain (-10.68, 50.15, 1.41, 59.69) English Indo-European -> Germanic
Figure 1. Bounding boxes projected on a Mercator map. Background image © Daniel Strebe, 15 August 2011, CC BY-SA 3.0
Figure 1. Bounding boxes projected on a Mercator map. Background image © Daniel Strebe, 15 August 2011, CC BY-SA 3.0

First collection results

341353 Tweets were collected in April and August 2019.

Cleanup and dataset preparation

To simplify further analyses, we started with a cleaned-up version of the dataset. The tweets were changed or removed, as described below, and some initial features were added. All of this was done using pandas, a data analysis framework/library for Python.

Original dataset

The csv file with the dataset had the rows “userid,username,location,datetime,lat,long,lang,text”, containing the author’s Twitter user ID, @username, location as provided by the user themselves, the date and time of the tweet in UTC, latitude, longitude, language of the tweet as detected by Twitter, and the text of the tweet.

Removing duplicates, short and possibly automatically generated tweets

First, tweets containing the exact same value in “text” were removed. There were 5889 (~1.7%) such tweets.

Tweets containing substrings indicative of automatic posting (“Just posted”, “has just posted”, “Want to work at”, “I’m at”) were also removed (also about 6%).

A number of tweets did not fit neatly inside bounding boxes. There may be various explanations for this, and they (77619, or ~22%) were also removed from the dataset.

Then, Tweets shorter than 40 characters (inside the clean column, so without counting, see below) were deleted (78439 of the remaining 194351 ones, so ~40%). 22 Figure 3a and 3b shows how the length of the tweets changed by country after this. The mode for char_count in this final dataset ranged from 42 for Brazil, Saudi Arabia and UK to 46 characters in India.

Figure 2. Lengths distribution of final unbalanced dataset
Figure 2. Lengths distribution of final unbalanced dataset
Figure 2. Lengths distribution of final unbalanced dataset
Figure 2. Lengths distribution of final unbalanced dataset

This left a dataset of 104001 tweets, or just 30% of the initial number.

Location and native language (L1)

Then, based on the latitude and longitude data, each tweet was labeled with one of our L1 categories. This was done by checking in which bounding box the tweet’s GPS coordinates were located. The results can be seen on Figure 2.

Balancing the dataset

The dataset at the beginning was unbalanced - as one can see on Figure 2 and 4, the languages (=target classes) were not represented uniformly. Two versions of the dataset were created, a balanced and an unbalanced one, to observe their effects on classification.

The balanced and unbalanced datasets will be referred to as TWB and TWUB from now on. TWUB contained 104001 tweets, TWB 49876.

Co    Balanced  Unbalanced
uk    46773     9975
mx    20212     9975
br    13654     9975
sa    13387     9975
in     9975     9975
Figure 4a. Language distribution of final unbalanced dataset
Figure 4a. Language distribution of final unbalanced dataset

After basic testing, it was concluded that the balanced dataset would be a better choice, and all further tests and classifications were based on it.

Feature engineering

Removing superfluous mentions

In Twitter, it’s possible to “mention” another user by prefixing their username with an “@” (@example). This is also done automatically when a user is replying to a tweet or otherwise taking part in a conversation with multiple users, in this case mentions are added to the beginning of the tweet (the user is free to override this though). The number of mentions does not count towards a tweet’s limit of 280 characters (longest tweet in the dataset is 964 characters long). This meant that even though the raw character count was sometimes much more than 280 characters, the tweet was still useless from a NLI perspective. If a tweet is even 350 characters long and 300 of those characters were usernames, there’s little data to work with.

Still, we felt that completely removing mentions was not a solution. Semantically, they represent something close to proper nouns, and their location in a tweet/sentence (and their number) might be significant.

It was decided to remove only the ones at the beginning, leaving maximum two mentions. For example, if a tweet has one or two mentions at the beginning, nothing changes, otherwise if there are mentions get removed (see Figure 5 for an example).

This allowed us to remove superfluous mentions while preserving as much of the information they give as possible. Leaving two mentions instead of one allows us to preserve the grammatical number of that part of the tweet, which might be significant. (TODO Test how changes to this nfluence the prediction – none/one/two/all)

On this step line breaks were also removed.

Figure 5. Results of removing superfluous mentions from the beginning of tweets.
Figure 5. Results of removing superfluous mentions from the beginning of tweets.

The original tweets were moved to the column otext, and the ones with fewer mentions were added in text.

Masking mentions, hashtags and URIs

A version of the tweets with all the mentions, hashtags and URIs removed was written to clean, and features like char_count and punctuation were calculated based on it. This left the question of what to do with mentions, hashtags and URIs in the actual training data. For further analysis, a column “masked” was created with the changes described below.

For bag of words and n-grams, the actual content of the mention is either irrelevant or counterproductive. In the first case, because a mention is the username of an account, and there are a lot of accounts, this would create ‘words’/tokens that are all different between each other but mean the same semantically. In the latter case, mentions used as words/tokens might be counterproductive for the stated purpose of native-language identification based on user-generated data. For example, India’s largest English-language newspaper is Times Of India (@timesofindia), and someone mentioning it has a high likehood of being from India, regardless of the other linguistical features. So while this would improve our classification results, they would not generalize as they are largely based on social-network-specific features as opposed to linguistical ones.

The same may be said about hashtags – #LokSabhaElections2019 (Lok Sabha is the lower house of India’s Parliament) was trending (a lot of users were tweeting about it) as I was gathering the dataset, and it would offer an easy way to “cheat”. On the other hand, sometimes words inside sentences are used as hashtags (for example, in the tweet “what are we waiting for #MPN #NowUnited #MPN #Uniters” the first hashtag is part of the sentence itself, th e last three are just hashtags), and completely removing them is not a solution.

For URIs mostly the same applies – URIs are different, mean mostly the same thing, and in case when they are significant they are not the features we want to use for classifying using language.

So all mentions, hashtags and URIs replaced by the words “REPLMENTION”, “REPLHASHTAG”, “REPLURI” in the column “masked”.

Basic features added during cleanup

All of these features were based on the column clean, the one with the Tweets after removing the all mentions and newlines.

  • char_count - the number of raw characters in the tweet.
  • word_count - the number of words in the tweet.
  • word_density - the number of characters divided by the number of words.
  • punctuation - number of punctuation characters divided by the number of words.
  • title_words - number of words starting with an uppercase letter divided by number of words.
  • upper_case_words - number of words written in all caps divided by number of words.


The next step was to break the tweets in tokens. This was done on all columns and tests were ran, and at the end only the column “masked” was tokenized.

For this, a custom tokenizer was used. A standard tokenizer, for example one provided by ntlk, would not have detected our Twitter entities, such as @mentions or hashtags. A Python regex-based script was used for this, adapted from Marco Bonzanini’s tokenizer23.


I used Python’s string.punctuation definition of punctuation, ‘!”#$%&'()*+,-./:;<=>?@[\]^_`{|}~’ . For each one, a column was created, starting with “p” followed by the ASCII code of the character (to avoid possible issues with column names).

Function words (a, the, …) counts

  • a – number of occurrences of the word (not the letter) “a” divided by number of words.
  • the – number of occurrences of the word “the” divided by number of words.

Grammatical(?) features

Parts of speech tagging

TODOM: Parts of speech are…. are important for us because ….

POS tagging in social media has its own complexities, well described in the dissertation of Tobias Horsmann “Robust Part-of-Speech Tagging of Social Media Text”. 24 In the dissertation, the example on Figure 5 is given. We tagged parts of speech using nltk, which may be another reason of possible suboptimal performance, this is another source of noise in our source data.

Figure 5. Complexities of POS tagging user-generated data, as described by Tobias Horsmann. TODO source inside this description.
Figure 5. Complexities of POS tagging user-generated data, as described by Tobias Horsmann. TODO source inside this description.

We have used nltk’s POS tagger, that encodes POS short strings. For each POS tag found in the entire dataset, a column was created, with the value of the number of occurrences of that POS in the Tweet.

This left the question of what to do with mentions, hashtags and URIs. We decided to tag them as their own parts of speech. While they don’t exist in classical English grammar and tagging them would have been problematic, they carry important information, and their location and number may also be significant.

The example tweet below shows the steps from a tweet, to the masked text, to tokens, to the tagged POS:

  • A huge thank you to @ROsterley from @CharityRetail for joining us today at the #HUKRetailConf held @ChesfordGrangeQ today. Fantastic insights into the #CharityRetail sector @hospiceuk #conference #hospice
  • A huge thank you to REPLMENT from REPLMENT for joining us today at the REPLHASH held REPLMENT today. Fantastic insights into the REPLHASH sector REPLMENT REPLHASH REPLHASH
  • [‘A’, ‘huge’, ‘thank’, ‘you’, ‘to’, ‘REPLMENT’, ‘from’, ‘REPLMENT’, ‘for’, ‘joining’, ‘us’, ‘today’, ‘at’, ‘the’, ‘REPLHASH’, ‘held’, ‘REPLMENT’, ‘today’, ‘.’, ‘Fantastic’, ‘insights’, ‘into’, ‘the’, ‘REPLHASH’, ‘sector’, ‘REPLMENT’, ‘REPLHASH’, ‘REPLHASH’]
  • [‘DT’, ‘JJ’, ‘NN’, ‘PRP’, ‘TO’, ‘MENTION’, ‘IN’, ‘MENTION’, ‘IN’, ‘VBG’, ‘PRP’, ‘NN’, ‘IN’, ‘DT’, ‘HASHTAG’, ‘VBD’, ‘MENTION’, ‘NN’, ‘.’, ‘JJ’, ‘NNS’, ‘IN’, ‘DT’, ‘HASHTAG’, ‘NN’, ‘MENTION’, ‘HASHTAG’, ‘HASHTAG’]

The following POS tags are used, list and examples from 25:

  • CC coordinating conjunction
  • CD cardinal digit
  • DT determiner
  • EX existential there (like: “there is” … think of it like “there exists”)
  • FW foreign word
  • IN preposition/subordinating conjunction
  • JJ adjective ‘big’
  • JJR adjective, comparative ‘bigger’
  • JJS adjective, superlative ‘biggest’
  • LS list marker 1)
  • MD modal could, will
  • NN noun, singular ‘desk’
  • NNS noun plural ‘desks’
  • NNP proper noun, singular ‘Harrison’
  • NNPS proper noun, plural ‘Americans’
  • PDT predeterminer ‘all the kids’
  • POS possessive ending parent’s
  • PRP personal pronoun I, he, she
  • PRP$ possessive pronoun my, his, hers
  • RB adverb very, silently,
  • RBR adverb, comparative better
  • RBS adverb, superlative best
  • RP particle give up
  • TO to go ‘to’ the store.
  • UH interjection errrrrrrrm
  • VB verb, base form take
  • VBD verb, past tense took
  • VBG verb, gerund/present participle taking
  • VBN verb, past participle taken
  • VBP verb, sing. present, non-3d take
  • VBZ verb, 3rd person sing. present takes
  • WDT wh-determiner which
  • WP wh-pronoun who, what
  • WP$ possessive wh-pronoun whose
  • WRB wh-abverb where, when

Our additional POS tags:

  • HASHTAG #hashtag
  • URL
  • MENTION @mention

A final row in the dataset looked like this:

co               br                  
Unnamed: 1       43                  
userid           1156444526468325378 
username         AlineBia6           
location         Manaus, Brasil      
datetime         2019-07-31 06:00:35 
lat                            -2.57 
long                          -59.98 
lang             en                  
otext            “I took your words\nAnd I believed\nIn everything\n
                 You said to me\nYeah huh\nThat's right\nIf someone said three years 
                 from now\nYou'd be long gone\nI'd stand up and punch them out\n'Cause
                 they're all wrong\nI know better\n'Cause you said forever\nAnd ever\nWho
text             “I took your words And I believed In everything You 
                 said to me Yeah huh That's right If someone said three years from now 
                 You'd be long gone I'd stand up and punch them out 'Cause they're all 
                 wrong I know better 'Cause you said forever And ever Who knew” 
masked           “I took your words And I believed In everything You said 
                 to me Yeah huh That's right If someone said three years from now You'd
                 be long gone I'd stand up and punch them out 'Cause they're all wrong 
                 I know better 'Cause you said forever And ever Who knew” REPLURI
clean            “I took your words And I believed In everything You said 
                 to me Yeah huh That's right If someone said three years from now You'd 
                 be long gone I'd stand up and punch them out 'Cause they're all wrong I 
                 know better 'Cause you said forever And ever Who knew” 
char_count       278 
word_count       51  
word_density                    5.35
punctuation                     0.22
title_words                     0.24
upper_case_words                0.06
the                             0.00
a                               0.00
tokens           ['“', 'I', 'took', 'your', 'words', 'And', 'I', 'believed', 
                'In', 'everything', 'You', 'said', 'to', 'me', 'Yeah', 'huh', "That's",
                'right', 'If', 'someone', 'said', 'three', 'years', 'from', 'now', "You'd",
                'be', 'long', 'gone', "I'd", 'stand', 'up', 'and', 'punch', 'them', 'out',
                "'", 'Cause', "they're", 'all', 'wrong', 'I', 'know', 'better', "'", 'Cause', 
                'you', 'said', 'forever', 'And', 'ever', 'Who', 'knew', '”', 'REPLURI']
pos              ['NN', 'PRP', 'VBD', 'PRP$', 'NNS', 'CC', 'PRP', 'VBD', 
                'IN', 'NN', 'PRP', 'VBD', 'TO', 'PRP', 'NNP', 'VBD', 'NNP', 'NN', 'IN',
                'NN', 'VBD', 'CD', 'NNS', 'IN', 'RB', 'NNP', 'VB', 'RB', 'VBN', 'NNP',
                'VBP', 'RP', 'CC', 'VB', 'PRP', 'RP', "''", 'NNP', 'IN', 'DT', 'JJ', '
                PRP', 'VBP', 'RB', 'POS', 'NNP', 'PRP', 'VBD', 'RB', 'CC', 'RB', 'NNP',
                'VBD', 'NNP', 'URL']
pos_count        55
URL                             1.00 
JJS                             0.00 
,                               0.00 
NNS                             2.00 
POS                             1.00 
VBD                             7.00 
LS                              0.00 
RB                              5.00 
WP                              0.00 
''                              1.00 
MENTION                         0.00 
#                               0.00 
HASHTAG                         0.00 
NNPS                            0.00 
UH                              0.00 
VBG                             0.00 
DT                              1.00 
TO                              1.00 
WRB                             0.00 
PRP$                            1.00 
``                              0.00 
RBS                             0.00 
CC                              3.00 
VB                              2.00 
FW                              0.00 
MD                              0.00 
IN                              4.00 
PDT                             0.00 
WP$                             0.00 
VBZ                             0.00 
VBP                             2.00 
PRP                             7.00 
VBN                             1.00 
)                               0.00 
RBR                             0.00 
CD                              1.00 
EX                              0.00 
NNP                             8.00 
NN                              4.00 
SYM                             0.00 
$                               0.00 
JJ                              1.00 
:                               0.00 
.                               0.00 
WDT                             0.00 
JJR                             0.00 
RP                              2.00 
(                               0.00 
p!               0                   
p"               0                   
p#               0                   
p$               0                   
p%               0                   
p&               0                   
p'               6                   
p(               0                   
p)               0                   
p*               0                   
p+               0                   
p,               0                   
p-               0                   
p.               1                   
p/               3                   
p:               1                   
p;               0                   
p<               0                   
p=               0                   
p>               0                   
p?               0                   
p@               0                   
p[               0                   
p\               0                   
p]               0                   
p^               0                   
p_               0                   
p`               0                   
p{               0                   
p|               0                   
p}               0                   
p~               0                   


Classification goals

Additional remarks

TODO move this somewhere

Lastly, it could be argued that the issues with hashtags described above apply also to words, albeit a bit less. Since there was an election in India in the time the dataset was collected, some of the models might confer additional meaning to the usually neutral word “election” and give biased predictions to Tweets containing this word. This might be partly fixed by a much larger dataset gathered over longer periods of time. But the issues of some topics or words appearing more often in the tweets of certain countries simply because the concept they represent is more likely to be talked about in certain countries is much harder to mitigate this way.

At the beginning, during initial testing, I wanted to see if there are words much more likely to be used by certain L1-speakers. I first trained a NN on the same dataset using simple Bag of words features. Then I classified each single word seen in the dataset as if it were a tweet, then sorted them by how confident the NN was in its classification. Unsurprisingly, the result was words like rikshaw or city names, not subtler word use patterns as I had hoped.

TODO what do we need? someone hiding his/her L1 will not use ‘rikshaw’ **TODO describe the two goals and ways to get there **


TODO both baselines and classifications on unbalanced dataset – images should be side by side since they are uninteresting and don’t really merit 1/3 page First, baseline results were calculated, both for TWB and TWUB. TODOM Why are baseline results useful? Sklearn’s sklearn.dummy.DummyClassifier was used. 26

Two algorithms were used for this – stratified and most frequent, and both of them are more interesting in case of an unbalanced dataset.

Stratified dummy classifier

TODO it does… A stratified dummy classifier returned values distributed like the TWUB dataset:

Figure 6a. Confusion matrix for stratified dummy classifier on unbalanced dataset
Figure 6a. Confusion matrix for stratified dummy classifier on unbalanced dataset

Accuracy was 0.28.

Most-frequent dummy classifier

It predictably returned only UK tweets:

Figure 6b. Confusion matrix for most-frequent dummy classifier on unbalanced dataset
Figure 6b. Confusion matrix for most-frequent dummy classifier on unbalanced dataset

Accuracy was 0.44.

Dummy classifiers for balanced datasets

For TWB, all classifiers retured ~20%, that is 100% divided by the number of our target categories.

we will consider 20% as a good baseline for all future classification on both datasets.

Classification on basic features and effects of unbalanced datasets

A number of tests were done to classify the tweets based on number of punctuation marks, a/the, and POS tags, along with features like character count, number of capitalized words etc. The dataset was shuffled and then divided into a train and test set as usual, with a split of 0.7/0.3 training/testing respectively, and it resulted into 34912/14963 and 72800/31201 train/test Tweets for TWB/TWUB respectively.

For all tasks described in this chapter, the features described in the previous part were used, except ‘otext’,’userid’,’username’,’location’,’datetime’,’lang’,’text’, ‘tokens’, ‘pos’, ‘clean’, ‘lat’, ‘long’.

Tweet content/text was not used in this part, latitude and longitude were not used for obvious reasons. Datetime, location, language, latitude, longitude and datetime were never used. While date and especially time might have been interesting to take into account, they are too social-network specific, and I wanted to concentrate on the linguistical features. Social features for NLI have been used and very well described in 27 and 13.

TODO for example, a SVN was..

Basic classification on unbalanced dataset

Having calculated our baselines, it’s possible to start classifying the tweets. Three algorithms were used for this initial classification – a DNN classifier as implemented in Tensorflow’s Estimators API, a Random forest algorithm and a SVM algorith, both available in scikit-learn.

DNN on unbalanced dataset

I used Tensorflow’s DNNClassifier 28 implemented as part of the Estimators API. Two hidden layers were used, of 30 and 20. I did not use any cross-validation or hyperparameter tuning to choose the parameters. The choice of using two layers was partly motivated by 29 and the rule of thumb “less than half of the input layer” (89 features in this case), but I tried a lot of combinations even with hundreds of neurons, and there was a ceiling of accuracy (~51% for unbalanced dataset).

The result was 52% accuracy. From the confusion matrix one can see that … TODO

Figure 7a. Confusion matrix of DNN classifier running on unbalanced dataset
Figure 7a. Confusion matrix of DNN classifier running on unbalanced dataset

SVM on unbalanced

TODOM SVm is..

Figure 7c. Confusion matrix of SVM examples on unbalanced dataset
Figure 7c. Confusion matrix of SVM examples on unbalanced dataset

Accuracy: 42% Long and not ideal because it’s not a sparse dataset.

RF Unbalanced dataset

TODOM RF is ..

  • 1000 estimators
Figure 7b. Confusion matrix of Random Forest classifier with 1000 estimators running on unbalanced dataset
Figure 7b. Confusion matrix of Random Forest classifier with 1000 estimators running on unbalanced dataset

Classification based on basic features on TWB

Settings and parameters are same as in the previous subchapter, only the dataset changed.

Deep Neural Network on balanced dataset

Figure 7c. Confusion matrix of DNN on balanced dataset
Figure 7c. Confusion matrix of DNN on balanced dataset

Accuracy was 0.40

Random forest on balanced dataset

TODO Figure names and numbers

Figure 7d. Confusion matrix of Random forest on balanced dataset
Figure 7d. Confusion matrix of Random forest on balanced dataset

Accuracy: 0.46%

SVM on balanced dataset

Figure 7c. Confusion matrix of SVM examples on balanced dataset
Figure 7c. Confusion matrix of SVM examples on balanced dataset

Accuracy: 37%

Summary of results and effects of unbalanced datasets

  • TODO Unbalanced have higher accuracy, but are pretty much useless for our classification needs, because see confusion matrix
  • TODO interesting clusters between br-mx and uk-in that are predictable, weird en-sa

Classification with NLP features

After the results of the previous chapter, I decided to work only on the balanced datasets. A comparison was done between the classification on masked text (replacing usernames, mentions and URLs by REPLUSER, REPLMENT, REPLURI) and original text, all tokenized. Most of the results from this chapter were done only on the features mentioned in the titles, none of the POS-counts, a/the-counts, punctuations etc. described in the prev chapters were used. TODO I’m not mixing up text, clean, and masked, am I?

POS n-grams

TODOM tf-idf

Token/word n-grams

Figure X.
Figure X.
Figure X.
Figure X.
Figure X.
Figure X.

TODO stability of these results – I get +/- 5% on each – don’t quantify but mention

Ensemble learning

Task 1

  • improved SVN and NN and RF on both svn was decided based on state of the art paper with grid search here too!

    Task 2

  • improved with rf ** DO AGAIN WITHOTU STOP WORDS – new pictures from file, just take screeshots**


  • Created a new user-generated dataset
  • did basic classification, it kinda worked
  • really interesting about which families are confused


  • Explanation of my results
  • New results
  • Grenzen/Einschränkungen
  • Future research
    • Better dataset (see chapter on Dataset)



TODO remove these

  • Us -> me
  • Recheck figure names and numbers and order and text
  • Crop images better
  1. Part of speech | Definition of Part of speech at 

  2. (PDF) Time Perspective Profiles of Cultures 

  3. L. Boroditsky et al. “Sex, Syntax, and Semantics,” in D. Gentner and S. Goldin-Meadow, eds., Language in Mind: Advances in the Study of Language and Cognition (Cambridge, MA: MIT Press, 2003), 61–79. 

  4. Mother Tongues 

  5. A Report on the 2017 Native Language Identification Shared Task - file 

  6. Native-language identification - Wikipedia 





  11. L2 corpora | CLARIN ERIC 

  12., data from 2009. 

  13. “Online Human-Bot Interactions: Detection, Estimation, and Characterization”  2




  17., 7.3/7.4  2





  22. This made classifying real tweets shorter than 40 characters complicated, but I believe they would have been complicated to classify regardless. 


  24. “Robust Part-of-Speech Tagging of Social Media Text” TODO nice citation 

  25. Tokenization and Parts of Speech(POS) Tagging in Python’s NLTK library 

  26. sklearn.dummy.DummyClassifier — scikit-learn 0.21.3 documentation 

  27. “Native Language Identification with User Generated Content” 

  28. tf.estimator.DNNClassifier  |  TensorFlow Core r1.14  |  TensorFlow 

  29. The Number of Hidden Layers | Heaton Research