A clear, high-school level walkthrough of the Transformer paper, following the same story arc as the original: motivation -> architecture -> training -> results.
The paper introduces the Transformer: a sequence-to-sequence model that uses attention only. It removes recurrence and convolution, so training can run in parallel and finish faster, while translation quality improves.
Machine translation turns one sentence into another. Earlier systems (RNNs and CNNs) process tokens in order, which makes training slow and makes it harder to connect distant words.
Attention was already helpful, but it usually sat on top of a recurrent network. This paper asks: what if attention is the only building block?
The model keeps the classic encoder-decoder layout but swaps the guts. Both encoder and decoder are stacks of identical layers (six layers each in the base model).
Each token creates three vectors: Query (Q), Key (K), and Value (V). The model compares Q with all K's, turns those scores into weights, and uses them to mix the V's.
Instead of a single attention map, the Transformer uses multiple heads (8 in the base model). Each head can focus on a different pattern, such as syntax or long-range meaning.
After attention, each position goes through the same two-layer MLP (a small neural network). This adds non-linearity and extra capacity.
Because there is no recurrence, the model adds positional encodings to word embeddings. The paper uses sine/cosine waves to represent position, helping the model know word order.
Self-attention connects every token to every other token in one hop, so long-distance relationships are easy to learn.
The authors train on WMT14 English-German (4.5M sentence pairs) and WMT14 English-French (36M pairs). Sentences are split into subword tokens using byte-pair encoding.
Optimization uses Adam with a warmup schedule (first 4k steps) plus dropout and label smoothing (both 0.1) for regularization.
The Transformer (big) reaches 28.4 BLEU on English-German and 41.8 BLEU on English-French, beating prior systems while training faster.
The paper also applies the Transformer to English constituency parsing and shows it generalizes beyond translation.
Limitation: self-attention is O(n^2) in sequence length. The authors suggest exploring restricted attention for very long sequences and applying the model to other modalities.
@inproceedings{vaswani2017attention,
title={Attention Is All You Need},
author={Vaswani, Ashish and Shazeer, Noam and Parmar, Niki and Uszkoreit, Jakob and Jones, Llion and Gomez, Aidan N. and Kaiser, Lukasz and Polosukhin, Illia},
booktitle={Advances in Neural Information Processing Systems},
year={2017}
}