Vocabulary / Terms used

Mention -
Hashtag -
Tokenization – \

First language identification

basic theory and description

Ensemble learning

basic theory and description

papers and previous results about

Main Part


The dataset contains geo-tagged tweets from Twitter. Twitter1 is a social network with 126 million active users2, on which users interact with each other with short messages (tweets). Tweets originally were restricted to 140 characters, but in 2017 the limit was doubled to 280 for most languages. Users can optionally specify a location in their tweets, and when searching tweets can be filtered by their location. The location can either be specified as exact coordinates or a Twitter “Place”, which is a polygon, and has additional semantic information connected to it, like a city name34.

Additionally, Twitter automatically identifies the language of tweets, and it’s possible to filter tweets by their location (if they are geo-enabled) and their language at the same time.


  • It’s a user-generated dataset, which by itself …. TODO Advantages


A user-generated dataset like the one used here is not a “clean” dataset.

Twitter-specific issues

  • For the purposes of this thesis, we assume that the tweet location is a good proxy of the first language of the author, which is not always the case — people writing English tweets while being located in a certain country does ot mean that they are a native speaker of that country’s language5. This will impact the results of any classification.
  • Not all tweets are written by real people — Twitter has a big number of bots and tweets posted automatically. By some estimates, as much as 24% of tweets are created by bots 6, and 9% to 15% of active users are bots 7. Some of them are weather bots, financial aggregation bots, or novelty bots 8 with text that is not representative of real life language used by L1 native speakers.
  • In line with the point above, tweets can be automatically generated. For example, some users choose to configure automatic cross-posting of their activity in other social networks. This creates tweets that are very similar to each other (“%@username% just posted a photo on his Instagram account: %link%”). Twitter used to have an API feature to show and/or filter posts made from other clients, but this option was removed9 as part of a policy aimed at focusing on their official website and clients (as opposed to third party apps)10. This means that such content has to be found and filtered manually, which was attempted but should not be considered an absolutely effective solution. (TODO – add statistics about how many bots I removed manually).

Issues stemming from the type of data used

  • Tweets (in the general case) can be up to 280 characters long, but in fact are usually much shorter. This means that a tweet may not contain enough information to distinguish the L1 of the author. Usually, NLI is done on much longer texts, such as essays. (TODO – examples and sources)
  • Twitter’s language detection is not perfect, especially with shorter texts.
  • A way to overcome some of these is to get tweets from a user, and then classify them. I’ll try this if I have time.

Additional remarks

Usually, NLI done on spoken text transcription is much less precise than NLI done on essays or written text, some sources report as much as a 10% difference 11. There may be various causes for this, one of them may be that transcribed spoken text contains fewer features such as punctuation, and that people, when speaking, tend to use much simpler words and grammatical constructions 11. This may be an issue for this dataset - Tweets are not always complete sentences, and usually are written much closer to how people actually speak. The format is quite informal, and most people would not spend too much effort on correct grammar and punctuation for a tweet (this more typical for essays, and it’s essays that make up typical NLI datasets, such as the TOEFL11 NLI corpus 12).


The dataset was collected in April 2019. It contains Tweets from regions delimited by bounding boxes (GPS coordinates) in the English language.


For communication with the Twitter API, the tweepy (“An easy-to-use Python library for accessing the Twitter API”)13 package has been used. The final script used for collecting the tweets from a Twitter Stream 14 was heavily based on GeoSearch-Tweepy15, a script which contained examples of how to filter Twitter Streams by geolocation data.

Countries and bounding boxes

The countries used in this thesis are India, Saudi Arabia, Brazil, Mexico, and Great Britain. These countries were chosen because of their large number of Twitter users (to make data collection easier) and because they represent languages from different language families.

Brazil and Mexico have different languages that nonetheless belong to the same language family, and if our assumptions and process are valid, they should have the highest mutual mis-classification.

The bounding box for Great Britain also contains Ireland, since they are close linguistically (and fit neatly into one bounding box). In India a lot of languages are spoken, but most belong to two unrelated language families: Indo-European->Indo-Iranian and Dravidian, 78% and 20% of speakers respectively.

Country Bounding box Language Language family
India (67.46, 5.43, 90.71, 27.21) 122 major languages 78% Indo-European->Indo-Iranian, 19.64% Dravidian
Saudi Arabia (34.77, 12.86, 49.84, 30.19), (48.1, 13.93, 60.25, 24.77) Arabic Afro-Asiatic -> Semitic
Brazil (-58.55, -30.11, -35.26, 2.5), (-67.3, -13.03, -34.38, 1.53) Portuguese Indo-European -> Italic
Mexico (-112.59, 17.98, -85.38, 27.75) Spanish (de facto) Indo-European -> Italic
Great Britain (-10.68, 50.15, 1.41, 59.69) English Indo-European -> Germanic
Figure 1. Bounding boxes projected on a Mercator map. Background image © Daniel Strebe, 15 August 2011, CC BY-SA 3.0
Figure 1. Bounding boxes projected on a Mercator map. Background image © Daniel Strebe, 15 August 2011, CC BY-SA 3.0

First collection results

107791 Tweets were collected, distributed the following way: Brazil 12033, India 13449, Mexico 6331, Saudi Arabia 11359, UK 64619.

Figure 2. Language of the collected tweets.
Figure 2. Language of the collected tweets.

Cleanup and dataset preparation

To simplify further analyses, we started with a cleaned-up version of the dataset. The tweets were changed or removed, as described below, and some initial features were added. All of this was done using pandas, a data analysis framework/library for Python.

Original dataset

The csv file with the dataset had the rows “userid,username,location,datetime,lat,long,lang,text”, containing the author’s Twitter user ID, @username, location as provided by the user themselves, the date and time of the tweet in UTC16, latitude, longitude, language of the tweet as detected by Twitter, and the text of the tweet.

Figure 3. The initial dataset after collection.
Figure 3. The initial dataset after collection.

Location and native language (L1)

Then, based on the latitude and longitude data, each tweet was labeled with one of our L1 categories. This was done by checking in which bounding box the tweet’s GPS coordinates were located. The results can be seen on Figure 2.

Removing superfluous mentions

In Twitter, it’s possible to “mention” another user by prefixing their username with an “@” (@example). This is also done automatically when a user is replying to a tweet or otherwise taking part in a conversation with multiple users, in this case mentions are added to the beginning of the tweet (the user is free to override this though). The number of mentions does not count towards a tweet’s limit of 280 characters (longest tweet in the dataset is 964 characters long). This meant that even though the raw character count was sometimes much more than 280 characters, the tweet was still useless from a NLI perspective. If a tweet is even 350 characters long and 300 of those characters were usernames, there’s little data to work with.

Still, we felt that completely removing mentions was not a solution. Semantically, they represent something close to proper nouns, and their location in a tweet/sentence (and their number) might be significant.

It was decided to remove only the ones at the beginning, leaving maximum two mentions. For example, if a tweet has one or two mentions at the beginning, nothing changes, otherwise if there are mentions get removed (see Figure 5 for an example).

This allowed us to remove superfluous mentions while preserving as much of the information they give as possible. Leaving two mentions instead of one allows us to preserve the grammatical number of that part of the tweet, which might be significant. (TODO Test how changes to this nfluence the prediction – none/one/two/all)

On this step line breaks were also removed.

Figure 5. Results of removing superfluous mentions from the beginning of tweets.
Figure 5. Results of removing superfluous mentions from the beginning of tweets.

The original tweets were moved to the column otext, and the with less mentions were added in text.

Removing short and possibly automatically generated tweets

Tweets shorter than 20 characters (not counting removed mentions) were removed (6000 of them, about 5.5%). 17 Tweets containing substrings indicative of automatic posting (“Just posted”, “has just posted”, “Want to work at”, “I’m at”) were also removed (also about 6%).

This left a dataset of 95355 tweets, or 88% of the initial number.

Feature engineering

Basic features added during cleanup

All of these features were counted based on the column text, the one with the Tweets after removing the superfluous mentions and newlines. But the remaining mentions were counted. – make a new dataset

  • char_count - the number of raw characters in the tweet.
  • word_count - the number of words in the tweet.
  • word_density - the number of characters divided by the number of words.
  • punctuation - number of punctuation characters ('!"#$%&\'()*+,-./:;<=>?@[\\]^_{|}~’`) divided by the number of words.
  • title_words - number of words starting with an uppercase letter divided by number of words.
  • upper_case_words - number of words written in all caps divided by number of words.
  • a – number of occurrences of the word (not the letter) “a” divided by number of words.
  • the – number of occurrences of the word “the” divided by number of words.

  • co - country/L1 from which the tweet was written, our target function for classification.


The next step was to break the tweets in tokens. For this, a custom tokenizator was used. A standard tokenisator, for example one provided by ntlk, would not have detected our Twitter entities, such as @mentions or hashtags. A Python regex-based script was used for this, adapted from Marco Bonzanini’s tokenizer18.

Grammatical features

Parts of speech

Parts of speech (POS) were tagged using nltk. TODO: Parts of speech are…. are important for us because ….

Nltk’s tokenizer encodes POS a short strings.

This left the question of what to do with mentions, hashtags and URIs. We decided to tag them as their own parts of speech, because TODO.

The example tweet below shows the steps from a tweet, to tokens, to the tagged POS:

A huge thank you to @ROsterley from @CharityRetail for joining us today at the #HUKRetailConf held @ChesfordGrangeQ today. Fantastic insights into the #CharityRetail sector @hospiceuk #conference #hospice  
[‘A’, ‘huge’, ‘thank’, ‘you’, ‘to’, ‘@ROsterley’, ‘from’, ‘@CharityRetail’, ‘for’, ‘joining’, ‘us’, ‘today’, ‘at’, ‘the’, ‘#HUKRetailConf’, ‘held’, ‘@ChesfordGrangeQ’, ‘today’, ‘.’, ‘Fantastic’, ‘insights’, ‘into’, ‘the’, ‘#CharityRetail’, ‘sector’, ‘@hospiceuk’, ‘#conference’, ‘#hospice’] [‘DT’, ‘JJ’, ‘NN’, ‘PRP’, ‘TO’, ‘MENTION’, ‘IN’, ‘MENTION’, ‘IN’, ‘VBG’, ‘PRP’, ‘NN’, ‘IN’, ‘DT’, ‘HASHTAG’, ‘VBD’, ‘MENTION’, ‘NN’, ‘.’, ‘JJ’, ‘NNS’, ‘IN’, ‘DT’, ‘HASHTAG’, ‘NN’, ‘MENTION’, ‘HASHTAG’, ‘HASHTAG’]

We deemed the following the most important:

  • NN noun, singular ‘desk’
  • JJ adjective ‘big’
  • NNP proper noun, singular ‘Harrison’
  • IN preposition/subordinating conjunction
  • CC coordinating conjunction
  • VB verb, base form take
  • DT determiner
  • CD cardinal digit
  • PRP personal pronoun I, he, she

  • HASHTAG #hashtag
  • URL
  • MENTION @mention

The relative frequency of each of them was added as a separate column / feature.

TODO: Improve the dataset by using the entire array of Pos, after somehow deciding which ones are the important ones; also change the examples.

A final row in the dataset looked like this:

userid              1116271482848464897
username            TowerHamTennis
location            East London, UK
datetime            2019-04-11 09:27:15
long                              -0.04
lang                en
otext               Great to see energy levels up this morning @TowerHamTennis #EastLondonTennis \n\nJust a few spaces left next week #IsleofDogs #Shadwell #Wapping
char_count          190
text                Great to see energy levels up this morning @TowerHamTennis #EastLondonTennis Just a few spaces left next week #IsleofDogs #Shadwell #Wapping
co                  uk
word_count          22
word_density                       8.26
punctuation                        0.68
title_words                        0.18
upper_case_words                   0.00
the                                0.00
a                                  0.05
tokens              ['Great', 'to', 'see', 'energy', 'levels', 'up', 'this', 'morning', '@TowerHamTennis', '#EastLondonTennis', 'Just', 'a', 'few', 'spaces', 'left', 'next', 'week', '#IsleofDogs', '#Shadwell', '#Wapping', '', '']
pos                 ['NN', 'TO', 'VB', 'NN', 'NNS', 'RP', 'DT', 'NN', 'MENTION', 'HASHTAG', 'NNP', 'DT', 'JJ', 'NNS', 'VBD', 'JJ', 'NN', 'HASHTAG', 'HASHTAG', 'HASHTAG', 'URL', 'URL']
pos_count           22
RBS                                0.00
``                                 0.00
WRB                                0.00
NN                                 4.00
WDT                                0.00
JJR                                0.00
(                                  0.00
SYM                                0.00
UH                                 0.00
MD                                 0.00
VBZ                                0.00
POS                                0.00
)                                  0.00
DT                                 2.00
CC                                 0.00
JJS                                0.00
EX                                 0.00
VB                                 1.00
CD                                 0.00
VBP                                0.00
VBD                                1.00
RBR                                0.00
RB                                 0.00
NNP                                1.00
JJ                                 2.00
PRP                                0.00
WP                                 0.00
HASHTAG                            4.00
FW                                 0.00
NNS                                2.00
''                                 0.00
NNPS                               0.00
TO                                 1.00
LS                                 0.00
$                                  0.00
MENTION                            1.00
PDT                                0.00
RP                                 1.00
:                                  0.00
IN                                 0.00
,                                  0.00
URL                                2.00
PRP$                               0.00
.                                  0.00
WP$                                0.00
VBN                                0.00
#                                  0.00
VBG                                0.00

Bag of words

Function words (a, the, …) counts

Initial analysis







  • Explanation of my results
  • New results
  • Grenzen/Einschränkungen
  • Future research







  5. TODO? it would be possible to filter usernames by the amout of tweets they have in L1 and the amout of geolocated tweets they have in L1-country 

  6., data from 2009. 

  7. “Online Human-Bot Interactions: Detection, Estimation, and Characterization” 




  11., 7.3/7.4  2





  16. TODO check this 

  17. This would make it much harder to classify real tweets if they are shorter than 20 characters, TODO test by how much. Also probably 20 characters is still not enough, 50 would be better. I need a bigger dataset.