BA/app

Primary sources

Computer linguistics: CL intro

Genetic Algorithms: An introduction to genetic algorithms

Linguistics: Dissertation partly about interferences. Has a nice error classification, error taxonomy, borrowing, tranfer etc etc. Seems like a nice intro to "What exists"

CL/ML resources

Text classification

Natural language classification with Python:Book, especially learning to classify text

With machine learning:

with tensorflow and generally nns: [1]
Machine Learning, NLP: Text Classification using scikit-learn, python and NLTK
Working with scikit and text data

Error Detection

error detection using local word bigram and trigram + some others Automatic error analysis of machine translation output -- more about possible errors and ways to classify them

Somewhat similar problems being solved

Cross-cultural Deception Detection. It uses unigrams + LIWC (which is more psychological and less relevant)

Deception detection -- has examples of extracted features which I might use
[2] -- lie detector
Linguistic Cues to Deception Assessed by Computer Programs: A Meta-Analysis -- also ideas of possible features that might be interesting to look into.

Think about sentiment detection etc

Most of those things are solved via Bag of Words which won't be enough for me, I think

Linguistics

Typical errors

Russian

Similar-sounding and semantically non-identical words + idioms
Grammar. Articles, connecting verbs, future tenss, negative sentences, commas etc etc -- really nice.

German

list of sentences
also examples, hard to generalize
examples, a bit better ones?
German reflexive verbs list which could be used to see differences between English and German reflexive verbs.

Indian

Wikipedia - Indian English I thing this could be done just statistically?

Italian

??? try italian or french, depending on what works better?

Attack plan

Analize errors in the languages, manually
Try to see if I can use them to formulate the best "features"?
Manually collect some very typical errors ("we'll meet us" for German,for ex) and also use them as features with a bigger weight?
1. Bonus points if I make it expandable as a framework -- manually adding such features should be easy
See if automatic error detection could be used, and the type of errors also as feature?
Get sources
1. Scrape English language learning forums from the respective countries
2. Ask some English teachers & Auslandsamt for (anonimyzed?) examples of students' essays?
3. I think quite a lot of such things are needed
ML, CL, classificators magic
See what happens

Possible issues:

"Bag of words" and similar would be very biased towards my specific sources. If I scrape German English from a lanugage learning forum and Russian English from students' essays about nature, any text about nature will be classified as Russian and about learning languages as English
Age, topics, etc etc etc?
Not enough source material for a corpus

Random

Natural Language Annotation for Machine Learning ebook, seems to cover quite a lot

downloads and demos -- datasets for CL lying detection -- generally interesting

Word embeddings in classification a very interesting method, another variant

Classification-as-a-service with free examples. Gender, MBTI, etc etc etc, pretty nice

BA/app

Contents