WAP PAPER / GRAD / EN

Attention Is All You Need

A graduate-level walkthrough that follows the original paper structure with core equations, hyperparameters, and empirical results.

All Versions HS EN 高中中文研究生中文

Paper Facts

Authors

Ashish Vaswani, Noam Shazeer, Niki Parmar, Jakob Uszkoreit, Llion Jones, Aidan N. Gomez, Lukasz Kaiser, Illia Polosukhin

Venue

NeurIPS (NIPS) 2017 / arXiv 1706.03762

Model

Transformer: encoder-decoder with multi-head self-attention.

Claim

Attention-only architecture beats RNN/CNN MT with better parallelism.

Abstract (technical)

The Transformer removes recurrence and convolution entirely, relying on self-attention for sequence modeling. This yields stronger translation quality and large speedups due to parallelizable computation.

Introduction & background

Prior sequence-to-sequence models used RNNs or CNNs to encode sequences and decode outputs. These architectures impose sequential computation and long-range dependency paths.

The paper proposes that attention alone can model global dependencies with shorter paths, enabling faster training while preserving or improving accuracy.

Model architecture

The Transformer keeps the encoder-decoder framework but builds each layer from two sublayers: multi-head self-attention and a position-wise feed-forward network. Residual connections wrap each sublayer, followed by layer normalization.

LayerNorm(x + Sublayer(x))

DepthN = 6 encoder layers and 6 decoder layers (base model).
Dimensionsd_model = 512, d_ff = 2048.
Headsh = 8, so d_k = d_v = 64 per head.

Scaled dot-product attention

Attention(Q, K, V) = softmax(QK^T / sqrt(d_k)) V

Scaling by sqrt(d_k) stabilizes gradients. Decoder self-attention is masked to preserve autoregressive generation.

Multi-head attention

Multiple heads attend in parallel to different subspaces; outputs are concatenated and projected back to d_model.

Position-wise feed-forward network

Each token passes through the same two-layer MLP with ReLU: Linear(d_model -> d_ff) -> ReLU -> Linear(d_ff -> d_model).

Embeddings & positional encoding

The model adds sinusoidal positional encodings to token embeddings, enabling order information without recurrence.

Why self-attention?

Self-attention has a constant path length between any two positions (1 hop) and can be computed in parallel. Compared with RNNs or CNNs, this offers faster training and better long-range dependency modeling.

Tradeoff: self-attention has O(n^2) time and memory in sequence length, motivating restricted attention for very long inputs.

Training setup

Training data comes from WMT14 English-German (4.5M pairs) and WMT14 English-French (36M pairs) with BPE vocabularies (about 37k / 32k).

BatchingAbout 25k source + 25k target tokens per batch.
Hardware8x NVIDIA P100 GPUs.
ScheduleBase: 100k steps (~12h). Big: 300k steps (~3.5 days).

Optimization uses Adam (beta1=0.9, beta2=0.98, epsilon=1e-9). The learning rate warms up for 4,000 steps, then decays proportional to step^-0.5. Regularization includes dropout=0.1 and label smoothing=0.1.

Results

Transformer (big) reaches 28.4 BLEU on WMT14 EN-DE and 41.8 BLEU on EN-FR, surpassing previous systems while training in a few days.

EN-DE: 28.4 BLEU

EN-FR: 41.8 BLEU

Time: 3.5 days (8 P100)

The model also generalizes to English constituency parsing, indicating broader applicability beyond MT.

Conclusion & outlook

The Transformer demonstrates that attention-only architectures can dominate sequence transduction. Future work focuses on handling very long sequences with restricted attention and extending the model to other modalities.

Resources

arXiv Abstract 1706.03762

Paper PDF Download PDF

NeurIPS Proceedings NIPS 2017

Reference Code Tensor2Tensor

Citation

@inproceedings{vaswani2017attention,
  title={Attention Is All You Need},
  author={Vaswani, Ashish and Shazeer, Noam and Parmar, Niki and Uszkoreit, Jakob and Jones, Llion and Gomez, Aidan N. and Kaiser, Lukasz and Polosukhin, Illia},
  booktitle={Advances in Neural Information Processing Systems},
  year={2017}
}