DNB and Typing

d4b 33% Sun 07 Apr 2019 04:24:36 PM CEST
d4b 33% Sun 07 Apr 2019 04:26:35 PM CEST
d4b 56% Sun 07 Apr 2019 04:28:28 PM CEST
d4b 61% Sun 07 Apr 2019 04:30:24 PM CEST
d4b 28% Sun 07 Apr 2019 04:32:21 PM CEST
d4b 44% Sun 07 Apr 2019 04:34:27 PM CEST
d4b 22% Sun 07 Apr 2019 04:36:19 PM CEST
d4b 39% Sun 07 Apr 2019 04:38:14 PM CEST

Quotes

“Wherever you are, make sure you’re there.” — Dan Sullivan

Diploma

Classifying by parts of speech

nltk.download() downloads everything needed. nltk.word_tokenize('aoethnsu') returns the tokens. From https://medium.com/@gianpaul.r/tokenization-and-parts-of-speech-pos-tagging-in-pythons-nltk-library-2d30f70af13b. For parts of speech it’s nltk.pos_tag(tokens).

The tokenizer for twitter works better for URLs (of course). Interestingly it sees URLs as NN. And - this is actually fascinating - smileys get tokenized differently!

 ('morning', 'NN'),
 ('✋', 'NN'),
 ('🏻', 'NNP'),

EDIT: nltk.tokenize.casual might be just like the above, but better!

EDIT: I have a column with the POS of the tweets! How do I classify it with its varying length? How can I use the particular emojis as another feature?

Ideas

POS + individual smileys might be enough for it to generalize! TODO test TODO: Maybe first do some much more basic feature engineering with capitalization and other features mentioned here:

    Word Count of the documents – total number of words in the documents
    Character Count of the documents – total number of characters in the documents
    Average Word Density of the documents – average length of the words used in the documents
    Puncutation Count in the Complete Essay – total number of punctuation marks in the documents
    Upper Case Count in the Complete Essay – total number of upper count words in the documents
    Title Word Count in the Complete Essay – total number of proper case (title) words in the documents
    Frequency distribution of Part of Speech Tags:
        Noun Count
        Verb Count
        Adjective Count
        Adverb Count
        Pronoun Count

Resources

textminingonline.com has nice resources on topic which would be very interesting to skim through! Additionally flair is a very interesting library not to reinvent the wheel, even though reinventing the wheel would be the entire point of a bachelor’s thesis.

This could work as a general high-levent intro into NLP? Also this.