How Transformers Work
The transformer is the architecture behind GPT, Claude, BERT, and every major language model. The interactive diagram below shows the full pipeline — from input tokens through stacked attention blocks to an output distribution. The sections below explain each component.
Enter any incomplete sentence to see how GPT-2 processes it
Each block in the diagram runs the same two operations: multi-head attention (letting tokens communicate) and a feed-forward network(processing each token independently). The purple Context strips are the actual attention outputs for each token: a weighted sum of the Value vectors (softmax(QK^T) * V). Those context vectors feed the feed-forward network token-by-token, and the last token's output flows to the language-model head to predict the next word.
What is a Transformer?
A transformer is a neural network that processes sequences — sentences, code, audio frames — by letting every element attend directly to every other element. Unlike earlier RNNs that processed tokens one at a time, a transformer processes the entire sequence in parallel.
The architecture has three stages:
- 1Embedding.Each input token is converted to a dense vector (typically 768 or 1024 dimensions). A positional encoding is added so the model knows each token's position.
- 2Transformer blocks. The embedding passes through N stacked blocks (12 for BERT-base, 96 for GPT-4). Each block runs two operations in sequence — first multi-head attention (letting tokens communicate), then a feed-forward network (refining each token independently) — with residual connections and layer normalization around each:
First
Multi-Head Attention
Every token attends to every other token simultaneously, weighted by relevance.
How attention works →Then
Feed-Forward Network
A small neural network applied independently to each token to refine its representation.
See it in the neural network visualizer → - 3Output head. For language modelling, a final linear layer maps the last hidden state to a probability distribution over the vocabulary. The next token is sampled from this distribution.
Why Stack Blocks?
Each block refines the token representations it receives. Early layers tend to capture surface-level patterns — punctuation, adjacent word relationships. Later layers encode higher-level structure: syntactic roles, coreference, semantic relationships.
This hierarchical processing — from surface to deep semantics — is why stacking many blocks is so powerful. Each block has its own set of learned attention weights, so different blocks can specialize in different types of relationships.
Residual Connections & Layer Norm
Each block in the visualizer above contains two sub-operations called sublayers: first multi-head attention, then the feed-forward network. Both are individually wrapped in the same two stabilizing operations. The input is added directly to the sublayer's output, then layer normalization rescales the result before it moves on.
Residual connection
Adding x to the sublayer's output creates a gradient highway: during backpropagation, error signals flow directly through the addition node, bypassing the sublayer entirely. This lets gradients reach the earliest layers without vanishing — the key reason transformers can stack 96+ blocks while older architectures struggled past 10.
Layer normalization
Normalizes each token's activation vector to zero mean and unit variance, then applies learned scale and shift parameters. Keeps values in a stable range as they pass through many layers, preventing activations from exploding or collapsing to zero.
How Transformers Learn
All the weights — WQ, WK, WV, WO, the FFN matrices, the embeddings — start random and are learned by gradient descent on a self-supervised objective. No human-labeled data required: the training signal comes from the text itself.
Decoder — next-token prediction
“The cat sat on the ___”Predict the next token from all previous tokens. Every position in every sentence is a training example — one document generates as many training signals as it has tokens. GPT, Claude, Llama all use this objective.
Encoder — masked token prediction
“The cat [MASK] on the mat”Randomly mask 15% of tokens and predict them from full bidirectional context. Produces richer per-token representations but can't generate text autoregressively. BERT uses this objective.
Both objectives scale with data: more text means more training examples with zero annotation cost. This is why pretraining on internet-scale corpora produces models that generalize to tasks the designers never explicitly targeted.
What Changes With Scale
The architecture stays constant. What changes is size: more layers, wider dimensions, more attention heads, larger context windows. The same training objective applied to more compute and more data consistently produces better models.
| Model | Params | Layers | Heads | Context |
|---|---|---|---|---|
| GPT-2 | 117M | 12 | 12 | 1,024 |
| GPT-3 | 175B | 96 | 96 | 2,048 |
| Llama 3 8B | 8B | 32 | 32 | 8,192 |
| GPT-4 | ~1.8T | ~120 | — | 128k |
Training data · tokens (log scale)
* estimated · bars show log₁₀ scale
GPT-4 architecture details are not publicly confirmed — figures are estimates from public research. Llama 3 uses grouped-query attention (GQA), which reduces the KV cache at the cost of some expressiveness.
Further reading
- Vaswani et al. — Attention Is All You Need (2017)The original transformer paper.
- Jay Alammar — The Illustrated TransformerAn excellent visual walkthrough.
- Attention deep dive →Real BERT weights, QKV breakdown, multi-head analysis.
Continue learning
Return to the neural network visualizer or explore how attention works in detail.