Subfigures / subfloats, pictures side by side

        \caption{Picture 1}
        \caption{Picture 2}

DNB and typing

505 cpm 98.8%
535 cpm 98.1%


“When I look into the future, it’s so bright it burns my eyes.” — Oprah Winfrey (as quoted here)


This HN thread has an interesting consensus that Colemak > Dvorak. Also:

I firmly believe that any differences or gain that people attribute to Dvorak is attributed to finally learning how to properly type.



I need a plan. Which I will formulate later today. Meanwhile, any of the papers by Volkova seem very relevant to what I’m trying to do. Though I think after a number of them I’m not getting any additional value from them. We’ll see.

Papers will be numbered in green colour with numbers [1], parts in the paper will be numbered in green in a circle and marked (1).

Main points

  • Shorter texts are harder to analyze – try to download more tweets from the users? And mix them to remove style.
  • Grammar checks and replacements are fascinating to do and analyze, do them
  • Baseline for accuracy
  • Download and look at the reddit dataset


NLI with User-generated Content [1]

Main points
  • The dataset and the L2 project of the Uni of Haifa
  • It has interesting baselines which I can use in my thesis as example..
  • (1) is an example why geography =! language
  • Filtering data from multi-language countries.
  • They have 230 million sentences, much more than I do. Is this the number of examples I need?
  • n-grams etc are very context/dependent, what I mean with TW and political stuff happening. They should yield high acc on training set but generalize poorly.
  • Spelling and grammar
    • original word and correction offered by spell checker
    • Grammar checked by LanguageTool (2)
  • Frequencies of function words
  • Trivial baseline for classification tasks
  • Nice table of how much the features were relevant, p.3596 (3)
  • "”the personal style of whe user may dominate thesubtler signal of his or her native language”
  • Substitutions as suggested by spell checker should be fascinating.
  • Shorter texts are much harder (6)
  • Download tweets by the same users to get a bigger corpus of things I can compare them from. Or, to protect myself from the influence of stylistics, just get many many more tweets.
    • If it goes bad – just categorize users with Twitter meta-data as features, and by downloading their last X tweets in a certain language.
  • Much easier to do native/nonnative and language family than language
    Interesting language:
  • To the best of my knowledge,
  • Related but different
  • accurate, albeit not perfect, proxy for the NL of the author.
  • Reasonably robust to

NATLID: Native language identification [2]

  • g authorship profiles
  • NLI is challenging even for humans.
  • in a CNN smaller (2-3) filter sizes work better (3)
  • Ensemble model, with a voting scheme proportionate to the acc of the models (4)
  • Spoken responses are easier to NLI because the written ones are supposed to be more formal and thought-out.
    • Which might be a win-win for my Twitter data. Should be much less thought out than essays, and much more of the L1 should “shine” through them!
  • Highest misclassification between close languages. - This is why I added BR to MX.
  • g arc length, downtoners and intensifiers, production rules, subject agreement – as features. (2)

A brief survey of text mining: Classification, clustering and Extraction techinques [3]

  • hard vs soft classification: hard is when you get a result, soft is where you get a probability for result.
    • SVM do hard classification, but you can modify them to give a probability
  • Naive Bayes works for independent input variables..
  • SVM rarely need feature engineering because it selects support vectors itself. It’s robust to high dimensionality. Text classif. is mostly sparse data - a really okay use case for SVMs. (3) is a description of why SVM work so well for NLI.
  • (4) “Text representation has a very large dimensionality, but the underlying data is sparse.” - ESPECIALLY SO FOR MY USE CASE.

  • In topic models, docs are a mixture of topics, where a topic is a prodbability dist over words
  • Read through again, understand and grok (1) all vector space models and math.
  • Grok Naive bayes.

Text classification: A recent overview [4]

  • Text classification is not too different from ML, main problem is text representation.
  • BoW is not less effective then counting, because words are unlikely to repeat
    • Especially so in my use case!
  • Imbalanced data problem - cost sensitive learning is needed (3)
  • Accuracy is not a good metric for an inbalanced dataset.
    • Both of the above point to a source [4]
  • g Latent Semantic Indexing
  • g source [4] in this paper
  • Ensemble learning, with 2 versions: one with text-dependent (BoW, n-grams), and one without, give results for both (or give the user a choice if I make it an app at the end)

Worst case scenario plan

Just do native vs nonnative, which should be pretty easy. Define native as “UK”

For tomorrow:

  • g adaptor grammar collocations, Stanford dependencies, CFG rules, Tree Substitution Grammar fragments. (Reddit paper, (2))
  • g weaker but more robust (3)