DNB and Typing

d4b 56% Wed 10 Apr 2019 10:18:00 AM CEST
d4b 33% Wed 10 Apr 2019 10:19:54 AM CEST
d4b 39% Wed 10 Apr 2019 10:21:51 AM CEST
d4b 17% Wed 10 Apr 2019 10:23:47 AM CEST
d3b 86% Wed 10 Apr 2019 10:25:24 AM CEST !
d4b 56% Wed 10 Apr 2019 10:27:22 AM CEST

d4b 56% Wed 10 Apr 2019 03:46:47 PM CEST
d4b 28% Wed 10 Apr 2019 03:48:45 PM CEST
d4b 39% Wed 10 Apr 2019 03:50:51 PM CEST

d4b 50% Wed 10 Apr 2019 05:47:24 PM CEST
d4b 61% Wed 10 Apr 2019 05:49:20 PM CEST
d4b 44% Wed 10 Apr 2019 05:51:17 PM CEST
d4b 22% Wed 10 Apr 2019 05:53:11 PM CEST
d4b 44% Wed 10 Apr 2019 05:55:03 PM CEST
d3b 71% Wed 10 Apr 2019 05:56:36 PM CEST !
d3b 71% Wed 10 Apr 2019 05:58:09 PM CEST !
d3b 86% Wed 10 Apr 2019 05:59:42 PM CEST !
d4b 39% Wed 10 Apr 2019 06:01:41 PM CEST

Random

Day 100, how nice! I’m glad this diary happened :)

Thesis

Today I’ll try to get as many not-NLP-features as possible – number of things, capitals, punctuation, etc and see if it predicts anything.

Pandas balancing datasets

One way to balance a dataset, to leave an equal number of points in all classes, is the third answer here:

g = df.groupby('class')
g.apply(lambda x: x.sample(g.size().min()).reset_index(drop=True)

For each group, we get a sample from it, with the size of the smallest group. Quite beautiful actually.

Basic non-NLP classification

I added the following columns:

  • char_count, word_count, word_density
  • puctuation_count, title_word_count, upper_case_word_count – though all relative, that is divided by number of words

Basic classification using scikit and the very basic features above. With four languages and the vanilla sklearn.svm.SVC I get the following confusion matrix:

array([[34, 22, 30, 24],
       [26, 49, 24, 29],
       [30, 33, 45, 25],
       [32, 22, 34, 41]])

which is still more than chance!

To print:

Papers with code

PWP is probably the best resource I’ve come across.

Language interference

Reddit about this topic – I think I need to narrow the field of the thesis and focus on two languages.

Ensemble learning

…would be very interesting to use. “Native language identification using ensemble learning” is a nice title. :) And gives me the opportunity to play with really a lot of algorithms.

Recording microphone and speaker

With openbroadcaststudio, obs. For now it creates a video, to extract the audio: ffmpeg -i 2019-04-10\ 18-27-43.flv -map 0:1 -vn output.ac3.

Todo: how to do it without obs, or with obs – how to do it without video.

Taking screenshots at regular intervals

The command is pretty predictable:

while true; do scrot -d 5 '%Y-%m-%d-%H:%M:%S.png' -e 'mv $f ~/Pictures/'; done for every 5 seconds

Stack

  • Remove delay when inserting sudo password