A graduate-level walkthrough that follows the original paper structure with core equations, hyperparameters, and empirical results.
The Transformer removes recurrence and convolution entirely, relying on self-attention for sequence modeling. This yields stronger translation quality and large speedups due to parallelizable computation.
Prior sequence-to-sequence models used RNNs or CNNs to encode sequences and decode outputs. These architectures impose sequential computation and long-range dependency paths.
The paper proposes that attention alone can model global dependencies with shorter paths, enabling faster training while preserving or improving accuracy.
The Transformer keeps the encoder-decoder framework but builds each layer from two sublayers: multi-head self-attention and a position-wise feed-forward network. Residual connections wrap each sublayer, followed by layer normalization.
Scaling by sqrt(d_k) stabilizes gradients. Decoder self-attention is masked to preserve autoregressive generation.
Multiple heads attend in parallel to different subspaces; outputs are concatenated and projected back to d_model.
Each token passes through the same two-layer MLP with ReLU: Linear(d_model -> d_ff) -> ReLU -> Linear(d_ff -> d_model).
The model adds sinusoidal positional encodings to token embeddings, enabling order information without recurrence.
Self-attention has a constant path length between any two positions (1 hop) and can be computed in parallel. Compared with RNNs or CNNs, this offers faster training and better long-range dependency modeling.
Training data comes from WMT14 English-German (4.5M pairs) and WMT14 English-French (36M pairs) with BPE vocabularies (about 37k / 32k).
Optimization uses Adam (beta1=0.9, beta2=0.98, epsilon=1e-9). The learning rate warms up for 4,000 steps, then decays proportional to step^-0.5. Regularization includes dropout=0.1 and label smoothing=0.1.
Transformer (big) reaches 28.4 BLEU on WMT14 EN-DE and 41.8 BLEU on EN-FR, surpassing previous systems while training in a few days.
The model also generalizes to English constituency parsing, indicating broader applicability beyond MT.
The Transformer demonstrates that attention-only architectures can dominate sequence transduction. Future work focuses on handling very long sequences with restricted attention and extending the model to other modalities.
@inproceedings{vaswani2017attention,
title={Attention Is All You Need},
author={Vaswani, Ashish and Shazeer, Noam and Parmar, Niki and Uszkoreit, Jakob and Jones, Llion and Gomez, Aidan N. and Kaiser, Lukasz and Polosukhin, Illia},
booktitle={Advances in Neural Information Processing Systems},
year={2017}
}