Definition
The Transformer is a neural network architecture introduced in 2017 by Vaswani et al. (Google) in “Attention Is All You Need”. Unlike recurrent architectures (RNN, LSTM), it processes entire sequences in parallel using self-attention mechanisms.
It’s the foundation of all modern LLMs (GPT, Claude, Llama), embedding models, and many computer vision models (Vision Transformers).
Key Components
Self-Attention: mechanism allowing each position in the sequence to “attend” to all other positions, weighting their relative importance. Captures long-range dependencies that RNNs struggled with.
Multi-Head Attention: multiple attention “heads” in parallel, each learning different patterns. Typically 8-96 heads per layer.
Feed-Forward Networks: fully-connected layers applied independently to each position after attention.
Positional Encoding: signal added to embeddings to encode position in sequence (attention is inherently position-invariant).
Layer Normalization and Residual Connections: stabilization techniques enabling training of very deep networks.
Architectural Variants
Encoder-only (BERT): processes entire sequence bidirectionally. Used for classification, NER, embeddings.
Decoder-only (GPT, Llama, Claude): autoregressive, generates tokens one at a time conditioned on previous ones. Dominates text generation.
Encoder-Decoder (T5, BART, original Transformer): encoder processes input, decoder generates output. Used for translation, summarization.
Computational Complexity
Attention has O(n²) complexity relative to sequence length n. For context windows of 100K+ tokens, this becomes prohibitive.
Efficient variants (Flash Attention, Sparse Attention, Linear Attention) reduce complexity to O(n) or O(n log n), enabling longer context without cost explosion.
Why It Dominated
Parallelization: unlike RNNs, all tokens are processed in parallel during training. Better utilizes modern GPUs.
Scaling: performance improves predictably with more parameters, data, and compute (scaling laws).
Transfer learning: pre-trained models on web-scale data transfer capabilities to downstream tasks with minimal fine-tuning.
Common Misconceptions
”Attention = understanding”
Attention is statistical weighting, not semantic comprehension. Attention patterns can be visualized but don’t always correspond to intuitive human interpretations.
”Transformers are only for NLP”
The architecture has extended to vision (ViT), audio (Whisper), proteins (AlphaFold 2), reinforcement learning, and multimodal domains.
”More layers = always better”
Beyond certain depths, gains are marginal and training/inference costs grow. Optimization balances width (hidden size), heads, and training data, not just depth.
Related Terms
- Attention Mechanism: core component of Transformer
- LLM: main application of Transformer architecture
- Embeddings: vector representations processed by Transformer
- Tokenization: input preprocessing for Transformer
Sources
- Vaswani, A. et al. (2017). Attention Is All You Need. NeurIPS
- Dosovitskiy, A. et al. (2020). An Image is Worth 16x16 Words: Transformers for Image Recognition at Scale. ICLR
- The Illustrated Transformer - Jay Alammar