Attention Is All You Need — Explained

The transformer architecture, introduced in the 2017 paper Attention Is All You Need, fundamentally changed how we build sequence models. Let's break it down.

The Core Idea

Traditional sequence models like RNNs and LSTMs process tokens one at a time. Transformers process the entire sequence in parallel using self-attention.

The self-attention mechanism computes a weighted sum of all positions in a sequence:

\text{Attention}(Q, K, V) = \text{softmax}\left(\frac{QK^T}{\sqrt{d_k}}\right)V

Where $Q$ , $K$ , $V$ are the query, key, and value matrices projected from the input embeddings.

Multi-Head Attention

Instead of a single attention function, the transformer uses multi-head attention — running $h$ attention functions in parallel:

import torch
import torch.nn as nn
 
class MultiHeadAttention(nn.Module):
    def __init__(self, d_model: int, n_heads: int):
        super().__init__()
        self.n_heads = n_heads
        self.d_k = d_model // n_heads
        self.W_q = nn.Linear(d_model, d_model)
        self.W_k = nn.Linear(d_model, d_model)
        self.W_v = nn.Linear(d_model, d_model)
        self.W_o = nn.Linear(d_model, d_model)
 
    def forward(self, x):
        B, T, C = x.shape
        q = self.W_q(x).view(B, T, self.n_heads, self.d_k).transpose(1, 2)
        k = self.W_k(x).view(B, T, self.n_heads, self.d_k).transpose(1, 2)
        v = self.W_v(x).view(B, T, self.n_heads, self.d_k).transpose(1, 2)
 
        attn = (q @ k.transpose(-2, -1)) / (self.d_k ** 0.5)
        attn = torch.softmax(attn, dim=-1)
 
        out = (attn @ v).transpose(1, 2).contiguous().view(B, T, C)
        return self.W_o(out)

Each head learns to attend to different aspects of the input — position, syntax, semantics.

Why It Matters

The transformer's parallelism enables training on massive datasets. This directly led to:

BERT (2018) — bidirectional pre-training
GPT series (2018–present) — autoregressive language modeling
Vision Transformers (2020) — applying attention to image patches

The key insight is that attention is a general-purpose computation primitive. It's not specific to language — it's a way to route information dynamically.

The field hasn't looked back since.