WAP PAPER / HS / EN

Attention Is All You Need

A clear, high-school level walkthrough of the Transformer paper, following the same story arc as the original: motivation -> architecture -> training -> results.

Paper Facts
Authors
Vaswani, Shazeer, Parmar, Uszkoreit, Jones, Gomez, Kaiser, Polosukhin
Venue
NeurIPS (NIPS) 2017 / arXiv 1706.03762
Problem
Machine translation and sequence-to-sequence modeling.
Core Claim
Pure attention can outperform RNN/CNN translation while training faster.

Abstract (plain version)

The paper introduces the Transformer: a sequence-to-sequence model that uses attention only. It removes recurrence and convolution, so training can run in parallel and finish faster, while translation quality improves.

Big result: strong BLEU scores on WMT14 English-German and English-French with much shorter training time.

Introduction & background

Machine translation turns one sentence into another. Earlier systems (RNNs and CNNs) process tokens in order, which makes training slow and makes it harder to connect distant words.

Attention was already helpful, but it usually sat on top of a recurrent network. This paper asks: what if attention is the only building block?

Goal
Remove sequential computation so training can parallelize.
Idea
Let every token directly look at every other token using self-attention.
Outcome
Faster training + better translation quality.

Transformer architecture

The model keeps the classic encoder-decoder layout but swaps the guts. Both encoder and decoder are stacks of identical layers (six layers each in the base model).

Encoder layer
Self-attention + a small feed-forward network.
Decoder layer
Masked self-attention + encoder-decoder attention + feed-forward.
Stability
Residual connections + layer normalization around every block.

Scaled dot-product attention

Each token creates three vectors: Query (Q), Key (K), and Value (V). The model compares Q with all K's, turns those scores into weights, and uses them to mix the V's.

Scaling by sqrt(d_k) keeps scores in a stable range so softmax doesn't saturate.

Multi-head attention

Instead of a single attention map, the Transformer uses multiple heads (8 in the base model). Each head can focus on a different pattern, such as syntax or long-range meaning.

Position-wise feed-forward

After attention, each position goes through the same two-layer MLP (a small neural network). This adds non-linearity and extra capacity.

Embeddings & positional encoding

Because there is no recurrence, the model adds positional encodings to word embeddings. The paper uses sine/cosine waves to represent position, helping the model know word order.

Why self-attention?

Self-attention connects every token to every other token in one hop, so long-distance relationships are easy to learn.

Training setup

The authors train on WMT14 English-German (4.5M sentence pairs) and WMT14 English-French (36M pairs). Sentences are split into subword tokens using byte-pair encoding.

Batching
About 25k source + 25k target tokens per batch.
Hardware
8x NVIDIA P100 GPUs.
Speed
Base model: 100k steps about 12 hours. Big model: 300k steps about 3.5 days.

Optimization uses Adam with a warmup schedule (first 4k steps) plus dropout and label smoothing (both 0.1) for regularization.

Results

The Transformer (big) reaches 28.4 BLEU on English-German and 41.8 BLEU on English-French, beating prior systems while training faster.

EN-DE: 28.4 BLEU
EN-FR: 41.8 BLEU
Training: 3.5 days on 8 GPUs

Generalization & outlook

The paper also applies the Transformer to English constituency parsing and shows it generalizes beyond translation.

Limitation: self-attention is O(n^2) in sequence length. The authors suggest exploring restricted attention for very long sequences and applying the model to other modalities.

Resources

arXiv Abstract 1706.03762
Paper PDF Download PDF
NeurIPS Proceedings NIPS 2017
Reference Code Tensor2Tensor

Citation

@inproceedings{vaswani2017attention,
  title={Attention Is All You Need},
  author={Vaswani, Ashish and Shazeer, Noam and Parmar, Niki and Uszkoreit, Jakob and Jones, Llion and Gomez, Aidan N. and Kaiser, Lukasz and Polosukhin, Illia},
  booktitle={Advances in Neural Information Processing Systems},
  year={2017}
}