WAP PAPER / GRAD / EN

Attention Is All You Need

A graduate-level walkthrough that follows the original paper structure with core equations, hyperparameters, and empirical results.

Paper Facts
Authors
Ashish Vaswani, Noam Shazeer, Niki Parmar, Jakob Uszkoreit, Llion Jones, Aidan N. Gomez, Lukasz Kaiser, Illia Polosukhin
Venue
NeurIPS (NIPS) 2017 / arXiv 1706.03762
Model
Transformer: encoder-decoder with multi-head self-attention.
Claim
Attention-only architecture beats RNN/CNN MT with better parallelism.

Abstract (technical)

The Transformer removes recurrence and convolution entirely, relying on self-attention for sequence modeling. This yields stronger translation quality and large speedups due to parallelizable computation.

Introduction & background

Prior sequence-to-sequence models used RNNs or CNNs to encode sequences and decode outputs. These architectures impose sequential computation and long-range dependency paths.

The paper proposes that attention alone can model global dependencies with shorter paths, enabling faster training while preserving or improving accuracy.

Model architecture

The Transformer keeps the encoder-decoder framework but builds each layer from two sublayers: multi-head self-attention and a position-wise feed-forward network. Residual connections wrap each sublayer, followed by layer normalization.

LayerNorm(x + Sublayer(x))
Depth
N = 6 encoder layers and 6 decoder layers (base model).
Dimensions
d_model = 512, d_ff = 2048.
Heads
h = 8, so d_k = d_v = 64 per head.

Scaled dot-product attention

Attention(Q, K, V) = softmax(QK^T / sqrt(d_k)) V

Scaling by sqrt(d_k) stabilizes gradients. Decoder self-attention is masked to preserve autoregressive generation.

Multi-head attention

Multiple heads attend in parallel to different subspaces; outputs are concatenated and projected back to d_model.

Position-wise feed-forward network

Each token passes through the same two-layer MLP with ReLU: Linear(d_model -> d_ff) -> ReLU -> Linear(d_ff -> d_model).

Embeddings & positional encoding

The model adds sinusoidal positional encodings to token embeddings, enabling order information without recurrence.

Why self-attention?

Self-attention has a constant path length between any two positions (1 hop) and can be computed in parallel. Compared with RNNs or CNNs, this offers faster training and better long-range dependency modeling.

Tradeoff: self-attention has O(n^2) time and memory in sequence length, motivating restricted attention for very long inputs.

Training setup

Training data comes from WMT14 English-German (4.5M pairs) and WMT14 English-French (36M pairs) with BPE vocabularies (about 37k / 32k).

Batching
About 25k source + 25k target tokens per batch.
Hardware
8x NVIDIA P100 GPUs.
Schedule
Base: 100k steps (~12h). Big: 300k steps (~3.5 days).

Optimization uses Adam (beta1=0.9, beta2=0.98, epsilon=1e-9). The learning rate warms up for 4,000 steps, then decays proportional to step^-0.5. Regularization includes dropout=0.1 and label smoothing=0.1.

Results

Transformer (big) reaches 28.4 BLEU on WMT14 EN-DE and 41.8 BLEU on EN-FR, surpassing previous systems while training in a few days.

EN-DE: 28.4 BLEU
EN-FR: 41.8 BLEU
Time: 3.5 days (8 P100)

The model also generalizes to English constituency parsing, indicating broader applicability beyond MT.

Conclusion & outlook

The Transformer demonstrates that attention-only architectures can dominate sequence transduction. Future work focuses on handling very long sequences with restricted attention and extending the model to other modalities.

Resources

arXiv Abstract 1706.03762
Paper PDF Download PDF
NeurIPS Proceedings NIPS 2017
Reference Code Tensor2Tensor

Citation

@inproceedings{vaswani2017attention,
  title={Attention Is All You Need},
  author={Vaswani, Ashish and Shazeer, Noam and Parmar, Niki and Uszkoreit, Jakob and Jones, Llion and Gomez, Aidan N. and Kaiser, Lukasz and Polosukhin, Illia},
  booktitle={Advances in Neural Information Processing Systems},
  year={2017}
}