Here I will be jotting down the basic structure of it, the basic results. I may or may not convert it to Latex sometime.

// This example in English is exactly what I need.

Introduction

Theoretical part

Literature

Main part

First language identification

With a working footnote.[^one]

Dataset

The dataset was created using Twitter geo-tagged tweets. Twitter1 is a social network with 126 million active users2, on which users interact with each other with short messages (tweets). They originally were restricted to 140 characters, but in 2017 the limit was doubled to 280 for most languages. Users can optionally specify locations in their tweets, and when searching they can be filtered by their location. The location can either be specified as exact GPS coordinates or a Twitter “Place”, which is a polygon, and has additional semantic information connected to it, like a city name34.

Additionally, Twitter automatically identifies the language of tweets. Thus, it’s possible to filter tweets by their GPS location (if they are geo-enabled) and their language.

Motivation

Advantages

Drawbacks

An user-generated dataset like the one used here is not a “clean” dataset.

Twitter-specific issues
  • For the purposes of this thesis, we assume that the tweet location is a good proxy of the first language of the author, which is not always the case — people writing English tweets while being located in a certain country does ot mean that they are a native speaker of that country’s language. {it would be possible to filter usernames by the amout of tweets they have in L1 and the amout of geolocated tweets they have in L1-country}. This will impact the results of any classification.
  • Not all tweets are written by real people — Twitter has a big number of bots, and of tweets posted automatically. By some estimates, as much as 24% of tweets are created by bots 5, and 9% to 15% of active users are bots 6. While some of the bots are operated by various news agencies, hotel chains and similar ones, that have language patterns relevant to this thesis, a number are weather bots, financial aggregation bots, or novelty bots 7 with text that is not representative of real life language used by L1 native speakers.
  • In line with the point above, tweets can be automatically generated. For example, some users choose to configure automatic cross-posting of their activity in other social networks. This creates highly formulac tweets (“%username just posted a photo on his Instagram account: %link%”). Twitter used to have an API feature to show and/or filter posts made from other clients, but it removed this 8, as part of a policy aimed at switching the focus from third-party apps to their official website and clients 9. This means that such content has to be found and filtered manually, which was attempted but should not be considered an absolutely effective solution. (See TODO).
    Issues stemming from the type of data used
  • Tweets (in the general case) can be up to 280 characters long, but in fact are usually much shorter. This, amongst other things, means that not too many grammatical (or any other) features can be inferred from a tweet.

    • uk @LillyMaryPinto It will be either uncle Quotrochi;s ghost who is not getting the Scotch as the Govt stopped arms brokers. Supreme court should be notified about this incident as they also will be affected in the cross fire in Rafale
    • uk Farmers are really affected by climate change, so can we build on that & use this to encourage producers to start acting for their own interest? - a great point by Nusa Arbancic from @ChangingMarkets #GrowGreen2019
    • sa “MIS day 📈💻” \n\n3 weeks left https://t.co/h5IKRKV6dv
    • uk Good luck to the St Ita’s Netball Squad competing today in the Belfast Primary Netball finals event in Lisburn Racquets #nischools
    • uk Why is #Facebook asking me about politics and if I’m registered to vote? https://t.co/7t064AefBh
    • uk @bill_easterly In the end, both black holes and economics are about information, so why not? However, physicists know far more about information in their profession than economists know about information in theirs.
  • Twitter’s language detection is not perfect.
  • In NLI tasks, usually, voice+

Collection

https://developer.twitter.com/en/docs/tweets/filter-realtime/guides/basic-stream-parameters.html

Cleanup

Features

Approaches

Bayes

SVM

Approach3

Results

Discussion

  • Explanation of my results
  • New results
  • Grenzen/Einschränkungen
  • Future research

Summary

References

  1. http://twitter.com 

  2. https://www.washingtonpost.com/technology/2019/02/07/twitter-reveals-its-daily-active-user-numbers-first-time/?utm_term=.4f7f009e07fa 

  3. https://developer.twitter.com/en/docs/tutorials/filtering-tweets-by-location.html 

  4. https://developer.twitter.com/en/docs/tweets/data-dictionary/overview/geo-objects 

  5. https://sysomos.com/inside-twitter/most-active-twitter-user-data/, data from 2009. 

  6. https://arxiv.org/pdf/1703.03107.pdf “Online Human-Bot Interactions: Detection, Estimation, and Characterization” 

  7. https://blog.mozilla.org/internetcitizen/2018/01/19/10-twitter-bots-actually-make-internet-better-place/ 

  8. https://thenextweb.com/twitter/2012/08/28/twitter-longer-displays-client-tweet-posted-web-emphasizing-first-party-reading-experience/ 

  9. https://www.theverge.com/2018/5/16/17362138/twitter-api-third-party-apps-changes-explained