Attention Is All You Need — Explained
A deep dive into the transformer architecture that revolutionized NLP, computer vision, and everything in between.
The transformer architecture, introduced in the 2017 paper Attention Is All You Need, fundamentally changed how we build sequence models. Let's break it down.
The Core Idea
Traditional sequence models like RNNs and LSTMs process tokens one at a time. Transformers process the entire sequence in parallel using self-attention.
The self-attention mechanism computes a weighted sum of all positions in a sequence:
Where , , are the query, key, and value matrices projected from the input embeddings.
Multi-Head Attention
Instead of a single attention function, the transformer uses multi-head attention — running attention functions in parallel:
import torch
import torch.nn as nn
class MultiHeadAttention(nn.Module):
def __init__(self, d_model: int, n_heads: int):
super().__init__()
self.n_heads = n_heads
self.d_k = d_model // n_heads
self.W_q = nn.Linear(d_model, d_model)
self.W_k = nn.Linear(d_model, d_model)
self.W_v = nn.Linear(d_model, d_model)
self.W_o = nn.Linear(d_model, d_model)
def forward(self, x):
B, T, C = x.shape
q = self.W_q(x).view(B, T, self.n_heads, self.d_k).transpose(1, 2)
k = self.W_k(x).view(B, T, self.n_heads, self.d_k).transpose(1, 2)
v = self.W_v(x).view(B, T, self.n_heads, self.d_k).transpose(1, 2)
attn = (q @ k.transpose(-2, -1)) / (self.d_k ** 0.5)
attn = torch.softmax(attn, dim=-1)
out = (attn @ v).transpose(1, 2).contiguous().view(B, T, C)
return self.W_o(out)Each head learns to attend to different aspects of the input — position, syntax, semantics.
Why It Matters
The transformer's parallelism enables training on massive datasets. This directly led to:
- BERT (2018) — bidirectional pre-training
- GPT series (2018–present) — autoregressive language modeling
- Vision Transformers (2020) — applying attention to image patches
The key insight is that attention is a general-purpose computation primitive. It's not specific to language — it's a way to route information dynamically.
The field hasn't looked back since.