BA/app

From Fiamma
Jump to navigationJump to search

Primary sources

Computer linguistics: CL intro

Genetic Algorithms: An introduction to genetic algorithms

Linguistics: Dissertation partly about interferences. Has a nice error classification, error taxonomy, borrowing, tranfer etc etc. Seems like a nice intro to "What exists"

CL/ML resources

Text classification

Natural language classification with Python:Book, especially learning to classify text

With machine learning:

Error Detection

error detection using local word bigram and trigram + some others Automatic error analysis of machine translation output -- more about possible errors and ways to classify them

Somewhat similar problems being solved

Cross-cultural Deception Detection. It uses unigrams + LIWC (which is more psychological and less relevant)


Think about sentiment detection etc

Most of those things are solved via Bag of Words which won't be enough for me, I think

Linguistics

Typical errors

Russian

German

Indian

Italian

??? try italian or french, depending on what works better?

Attack plan

  1. Analize errors in the languages, manually
  2. Try to see if I can use them to formulate the best "features"?
  3. Manually collect some very typical errors ("we'll meet us" for German,for ex) and also use them as features with a bigger weight?
    1. Bonus points if I make it expandable as a framework -- manually adding such features should be easy
  4. See if automatic error detection could be used, and the type of errors also as feature?
  5. Get sources
    1. Scrape English language learning forums from the respective countries
    2. Ask some English teachers & Auslandsamt for (anonimyzed?) examples of students' essays?
    3. I think quite a lot of such things are needed
  6. ML, CL, classificators magic
  7. See what happens

Possible issues:

  1. "Bag of words" and similar would be very biased towards my specific sources. If I scrape German English from a lanugage learning forum and Russian English from students' essays about nature, any text about nature will be classified as Russian and about learning languages as English
  2. Age, topics, etc etc etc?
  3. Not enough source material for a corpus

Random

Natural Language Annotation for Machine Learning ebook, seems to cover quite a lot

downloads and demos -- datasets for CL lying detection -- generally interesting

Word embeddings in classification a very interesting method, another variant

Classification-as-a-service with free examples. Gender, MBTI, etc etc etc, pretty nice