Difference between revisions of "BA/app"
Line 55: | Line 55: | ||
== Attack plan == | == Attack plan == | ||
− | # Analize errors in the languages | + | # Analize errors in the languages, manually |
− | # Try to see if I can | + | # Try to see if I can use them to formulate the best "features"? |
# Manually collect some very typical errors ("we'll meet us" for German,for ex) and also use them as features with a bigger weight? | # Manually collect some very typical errors ("we'll meet us" for German,for ex) and also use them as features with a bigger weight? | ||
## Bonus points if I make it expandable as a framework -- manually adding such features should be easy | ## Bonus points if I make it expandable as a framework -- manually adding such features should be easy | ||
+ | # See if automatic error detection could be used, and the type of errors also as feature? | ||
# Get sources | # Get sources | ||
## Scrape English language learning forums from the respective countries | ## Scrape English language learning forums from the respective countries |
Latest revision as of 14:31, 26 November 2017
Contents
Primary sources
Computer linguistics: CL intro
Genetic Algorithms: An introduction to genetic algorithms
Linguistics: Dissertation partly about interferences. Has a nice error classification, error taxonomy, borrowing, tranfer etc etc. Seems like a nice intro to "What exists"
CL/ML resources
Text classification
Natural language classification with Python:Book, especially learning to classify text
With machine learning:
- with tensorflow and generally nns: [1]
- Machine Learning, NLP: Text Classification using scikit-learn, python and NLTK
- Working with scikit and text data
Error Detection
error detection using local word bigram and trigram + some others Automatic error analysis of machine translation output -- more about possible errors and ways to classify them
Somewhat similar problems being solved
Cross-cultural Deception Detection. It uses unigrams + LIWC (which is more psychological and less relevant)
- Deception detection -- has examples of extracted features which I might use
- [2] -- lie detector
- Linguistic Cues to Deception Assessed by Computer Programs: A Meta-Analysis -- also ideas of possible features that might be interesting to look into.
Think about sentiment detection etc
Most of those things are solved via Bag of Words which won't be enough for me, I think
Linguistics
Typical errors
Russian
- Similar-sounding and semantically non-identical words + idioms
- Grammar. Articles, connecting verbs, future tenss, negative sentences, commas etc etc -- really nice.
German
- list of sentences
- also examples, hard to generalize
- examples, a bit better ones?
- German reflexive verbs list which could be used to see differences between English and German reflexive verbs.
Indian
- Wikipedia - Indian English I thing this could be done just statistically?
Italian
??? try italian or french, depending on what works better?
Attack plan
- Analize errors in the languages, manually
- Try to see if I can use them to formulate the best "features"?
- Manually collect some very typical errors ("we'll meet us" for German,for ex) and also use them as features with a bigger weight?
- Bonus points if I make it expandable as a framework -- manually adding such features should be easy
- See if automatic error detection could be used, and the type of errors also as feature?
- Get sources
- Scrape English language learning forums from the respective countries
- Ask some English teachers & Auslandsamt for (anonimyzed?) examples of students' essays?
- I think quite a lot of such things are needed
- ML, CL, classificators magic
- See what happens
Possible issues:
- "Bag of words" and similar would be very biased towards my specific sources. If I scrape German English from a lanugage learning forum and Russian English from students' essays about nature, any text about nature will be classified as Russian and about learning languages as English
- Age, topics, etc etc etc?
- Not enough source material for a corpus
Random
Natural Language Annotation for Machine Learning ebook, seems to cover quite a lot
downloads and demos -- datasets for CL lying detection -- generally interesting
Word embeddings in classification a very interesting method, another variant
Classification-as-a-service with free examples. Gender, MBTI, etc etc etc, pretty nice