Subfigures / subfloats, pictures side by side
DNB and typing
505 cpm 98.8% 535 cpm 98.1%
“When I look into the future, it’s so bright it burns my eyes.” — Oprah Winfrey (as quoted here)
This HN thread has an interesting consensus that Colemak > Dvorak. Also:
I firmly believe that any differences or gain that people attribute to Dvorak is attributed to finally learning how to properly type.
I need a plan. Which I will formulate later today. Meanwhile, any of the papers by Volkova seem very relevant to what I’m trying to do. Though I think after a number of them I’m not getting any additional value from them. We’ll see.
Papers will be numbered in green colour with numbers , parts in the paper will be numbered in green in a circle and marked (1).
- Shorter texts are harder to analyze – try to download more tweets from the users? And mix them to remove style.
- Grammar checks and replacements are fascinating to do and analyze, do them
- Baseline for accuracy
- Download and look at the reddit dataset
NLI with User-generated Content 
- The dataset and the L2 project of the Uni of Haifa
- It has interesting baselines which I can use in my thesis as example..
- (1) is an example why geography =! language
- Filtering data from multi-language countries.
- They have 230 million sentences, much more than I do. Is this the number of examples I need?
- n-grams etc are very context/dependent, what I mean with TW and political stuff happening. They should yield high acc on training set but generalize poorly.
- Spelling and grammar
- original word and correction offered by spell checker
- Grammar checked by LanguageTool (2)
- Frequencies of function words
- Trivial baseline for classification tasks
- Nice table of how much the features were relevant, p.3596 (3)
- "”the personal style of whe user may dominate thesubtler signal of his or her native language”
- Substitutions as suggested by spell checker should be fascinating.
- Shorter texts are much harder (6)
- Download tweets by the same users to get a bigger corpus of things I can compare them from. Or, to protect myself from the influence of stylistics, just get many many more tweets.
- If it goes bad – just categorize users with Twitter meta-data as features, and by downloading their last X tweets in a certain language.
- Much easier to do native/nonnative and language family than language
- To the best of my knowledge,
- Related but different
- accurate, albeit not perfect, proxy for the NL of the author.
- Reasonably robust to
NATLID: Native language identification 
- g authorship profiles
- NLI is challenging even for humans.
- in a CNN smaller (2-3) filter sizes work better (3)
- Ensemble model, with a voting scheme proportionate to the acc of the models (4)
- Spoken responses are easier to NLI because the written ones are supposed to be more formal and thought-out.
- Which might be a win-win for my Twitter data. Should be much less thought out than essays, and much more of the L1 should “shine” through them!
- Highest misclassification between close languages. - This is why I added BR to MX.
- g arc length, downtoners and intensifiers, production rules, subject agreement – as features. (2)
A brief survey of text mining: Classification, clustering and Extraction techinques 
- hard vs soft classification: hard is when you get a result, soft is where you get a probability for result.
- SVM do hard classification, but you can modify them to give a probability
- Naive Bayes works for independent input variables..
- SVM rarely need feature engineering because it selects support vectors itself. It’s robust to high dimensionality. Text classif. is mostly sparse data - a really okay use case for SVMs. (3) is a description of why SVM work so well for NLI.
(4) “Text representation has a very large dimensionality, but the underlying data is sparse.” - ESPECIALLY SO FOR MY USE CASE.
- In topic models, docs are a mixture of topics, where a topic is a prodbability dist over words
- Read through again, understand and grok (1) all vector space models and math.
- Grok Naive bayes.
Text classification: A recent overview 
- Text classification is not too different from ML, main problem is text representation.
- BoW is not less effective then counting, because words are unlikely to repeat
- Especially so in my use case!
- Imbalanced data problem - cost sensitive learning is needed (3)
- Accuracy is not a good metric for an inbalanced dataset.
- Both of the above point to a source 
- g Latent Semantic Indexing
- g source  in this paper
- Ensemble learning, with 2 versions: one with text-dependent (BoW, n-grams), and one without, give results for both (or give the user a choice if I make it an app at the end)
Worst case scenario plan
Just do native vs nonnative, which should be pretty easy. Define native as “UK”
- g adaptor grammar collocations, Stanford dependencies, CFG rules, Tree Substitution Grammar fragments. (Reddit paper, (2))
- g weaker but more robust (3)