Most of this while I’m reading the “Attention is all you need” paper. The most important resources will be The Illustrated Transformer – Jay Alammar – Visualizing machine learning one concept at a time and 9.3. Transformer — Dive into Deep Learning 0.7 documentation.


  • BLEU is a metric of how good machine translation is.
  • Gentle Introduction to Transduction in Machine Learning

    Induction, deriving the function from the given data. Deduction, deriving the values of the given function for points of interest. Transduction, deriving the values of the unknown function for points of interest from the given data. Relationship between

  • Positional encoding in the Transformer is very well described at 9.3. Transformer — Dive into Deep Learning 0.7 documentation, with a visualization. Needed because there is no notion of the order of words in the architecture 1 We can’t do n=1..10 because sentences have different lengths, and word 3 out of 10 is not the same as 3 out of 3.
    • “The intuition here is that adding these values to the embeddings provides meaningful distances between the embedding vectors once they’re projected into Q/K/V vectors and during dot-product attention” 2
  • Subword algorithms are ways to represent words that use elements bigger than characters but lower than a word embedding, for example prefixes and suffixes, to better handle unseen words. Byte-pair and word-piece encodings are used by the Transformer.[^swa]
  • In essence, label smoothing will help your model to train around mislabeled data and consequently improve its robustness and performance. 3

[^swa] (3 subword algorithms help to improve your NLP model performance)



  • I should make a better Bash timer that counts till the end of the hour, so I don’t have to do this in my head
  • I should make a vim keybinding or script that automagically creates Markdown references. (I’d be surprised if this hasn’t been done)