WAP PAPER / HS / EN

Attention Is All You Need

A clear, high-school level walkthrough of the Transformer paper, following the same story arc as the original: motivation -> architecture -> training -> results.

All Versions Grad EN 高中中文研究生中文

Paper Facts

Authors

Vaswani, Shazeer, Parmar, Uszkoreit, Jones, Gomez, Kaiser, Polosukhin

Venue

NeurIPS (NIPS) 2017 / arXiv 1706.03762

Problem

Machine translation and sequence-to-sequence modeling.

Core Claim

Pure attention can outperform RNN/CNN translation while training faster.

Abstract (plain version)

The paper introduces the Transformer: a sequence-to-sequence model that uses attention only. It removes recurrence and convolution, so training can run in parallel and finish faster, while translation quality improves.

Big result: strong BLEU scores on WMT14 English-German and English-French with much shorter training time.

Introduction & background

Machine translation turns one sentence into another. Earlier systems (RNNs and CNNs) process tokens in order, which makes training slow and makes it harder to connect distant words.

Attention was already helpful, but it usually sat on top of a recurrent network. This paper asks: what if attention is the only building block?

GoalRemove sequential computation so training can parallelize.
IdeaLet every token directly look at every other token using self-attention.
OutcomeFaster training + better translation quality.

Transformer architecture

The model keeps the classic encoder-decoder layout but swaps the guts. Both encoder and decoder are stacks of identical layers (six layers each in the base model).

Encoder layerSelf-attention + a small feed-forward network.
Decoder layerMasked self-attention + encoder-decoder attention + feed-forward.
StabilityResidual connections + layer normalization around every block.

Scaled dot-product attention

Each token creates three vectors: Query (Q), Key (K), and Value (V). The model compares Q with all K's, turns those scores into weights, and uses them to mix the V's.

Scaling by sqrt(d_k) keeps scores in a stable range so softmax doesn't saturate.

Multi-head attention

Instead of a single attention map, the Transformer uses multiple heads (8 in the base model). Each head can focus on a different pattern, such as syntax or long-range meaning.

Position-wise feed-forward

After attention, each position goes through the same two-layer MLP (a small neural network). This adds non-linearity and extra capacity.

Embeddings & positional encoding

Because there is no recurrence, the model adds positional encodings to word embeddings. The paper uses sine/cosine waves to represent position, helping the model know word order.

Why self-attention?

Self-attention connects every token to every other token in one hop, so long-distance relationships are easy to learn.

Parallel: all tokens are processed together.
Short paths: any two words are one attention step apart.
Tradeoff: attention cost grows with sequence length (quadratic in length).

Training setup

The authors train on WMT14 English-German (4.5M sentence pairs) and WMT14 English-French (36M pairs). Sentences are split into subword tokens using byte-pair encoding.

BatchingAbout 25k source + 25k target tokens per batch.
Hardware8x NVIDIA P100 GPUs.
SpeedBase model: 100k steps about 12 hours. Big model: 300k steps about 3.5 days.

Optimization uses Adam with a warmup schedule (first 4k steps) plus dropout and label smoothing (both 0.1) for regularization.

Results

The Transformer (big) reaches 28.4 BLEU on English-German and 41.8 BLEU on English-French, beating prior systems while training faster.

EN-DE: 28.4 BLEU

EN-FR: 41.8 BLEU

Training: 3.5 days on 8 GPUs

Generalization & outlook

The paper also applies the Transformer to English constituency parsing and shows it generalizes beyond translation.

Limitation: self-attention is O(n^2) in sequence length. The authors suggest exploring restricted attention for very long sequences and applying the model to other modalities.

Resources

arXiv Abstract 1706.03762

Paper PDF Download PDF

NeurIPS Proceedings NIPS 2017

Reference Code Tensor2Tensor

Citation

@inproceedings{vaswani2017attention,
  title={Attention Is All You Need},
  author={Vaswani, Ashish and Shazeer, Noam and Parmar, Niki and Uszkoreit, Jakob and Jones, Llion and Gomez, Aidan N. and Kaiser, Lukasz and Polosukhin, Illia},
  booktitle={Advances in Neural Information Processing Systems},
  year={2017}
}