Current results

I looked again at the confusion matrix, after having made a copy. It’s quite interesting:

array([[29, 14, 28, 26],
       [38, 57, 36, 27],
       [52, 18, 58, 28],
       [18, 14, 18, 39]])

This is a simple SVM, using extremely simple features, and 2000 examples per class. The columns/rows are: ar, jp, lib, it, in that order. My first error is that Arabic and countries which are around Libya are quite similar in my world, linguistically, and we can see that they are confused quite often, in both directions. Italy and Japan do much better.

  • Get more and better (linguistically more different) data.
  • Work with more interesting features.

Still, ich finde das sehr vielversprechend, and definitely better than chance. And logically it makes sense. I’ll continue.

Countries with the most Twitter users

The list. I’ll stick to Japan, UK, SA, Brazil, India – quite between each other, geographically and linguistically. I leave the US alone, too mixed.

Bounding boxes

This is the picker. DublinCore format is in the identical order as Twitter wants!

Probably the plan would be

  • Getting the dataset
    • Except the 5 languages I already have, add a similar one to the ones already available, to see how much confusion between the two I get at the end.
      • Added Mexico!
  • Preprocessing
    • Replace URLs and @mentions by tags.
    • Replace the actual words with their POS Tags
      • Leaving the Emoticons alone, since they are probably quite geographically distant
      • Leaving the usual punctuation and stop-words alone, since they probably are exactly what I need
    • Remove all usernames which contain ‘bot’ in their username
    • Find all tweets that are similar to each other by whatever metric and remove all of them too
      • This would work much better than what I could manually do, I can’t think of all possible robotic tweets
    • Then tokenize the resulting thing the usual way
  • Ensemble learning
    • I can get a number of classifiers and use some kind of voting procedure
    • BoW is counterproductive in my case, because too many geographical names and topic names. BUT it would be fascinating to get tweets from the same authors a number of years before, and compare if BoW gets less effective for these old tweets. I think it would be too focused for the ephemeral Twitter universe, if there’s an election in Brazil it will happily decide that all tweets containing ‘election’ are Brazilian - a comparison with old tweets would help me test this hypothesis. And give the user a choice at the end if the prediction should be done using everything or everything except BoW.

To research

  • Author profiling
    • By what markers is this usually done? Can I use some of them?

      For tomorrow/later

  • Finish doing the preprocessing script
    • In: the .csv
    • Out: Whatever I can import in Jupyter, with all the features etc


Leave rows with values from a certain list

d[['uk','in'])] leaves the rows where co==’uk’ or co==’in’.
For multiple conditions, df.loc[(df['column_name'] >= A) & (df['column_name'] <= B)]
TODO: Why is .loc used here?


  • Would putting an uninterrupted block of learning at the very beginning of my day help me?
  • This might become a very nice experiment – do it for 30 days and see what happens. If I sleep well I’m on my best in the mornings, apparently.
  • Publishing papers with markdown


Has a config file! This opened a new universe for me too.

Nearlyfreespeech ssh via public key

The key needs to be added from the panel, adding it to the user folder as usual does not work.