Transformer

Definition

The Transformer is a neural network architecture introduced in 2017 by Vaswani et al. (Google) in “Attention Is All You Need”. Unlike recurrent architectures (RNN, LSTM), it processes entire sequences in parallel using self-attention mechanisms.

It’s the foundation of all modern LLMs (GPT, Claude, Llama), embedding models, and many computer vision models (Vision Transformers).

Key Components

Self-Attention: mechanism allowing each position in the sequence to “attend” to all other positions, weighting their relative importance. Captures long-range dependencies that RNNs struggled with.

Multi-Head Attention: multiple attention “heads” in parallel, each learning different patterns. Typically 8-96 heads per layer.

Feed-Forward Networks: fully-connected layers applied independently to each position after attention.

Positional Encoding: signal added to embeddings to encode position in sequence (attention is inherently position-invariant).

Layer Normalization and Residual Connections: stabilization techniques enabling training of very deep networks.

Architectural Variants

Encoder-only (BERT): processes entire sequence bidirectionally. Used for classification, NER, embeddings.

Decoder-only (GPT, Llama, Claude): autoregressive, generates tokens one at a time conditioned on previous ones. Dominates text generation.

Encoder-Decoder (T5, BART, original Transformer): encoder processes input, decoder generates output. Used for translation, summarization.

Computational Complexity

Attention has O(n²) complexity relative to sequence length n. For context windows of 100K+ tokens, this becomes prohibitive.

Efficient variants (Flash Attention, Sparse Attention, Linear Attention) reduce complexity to O(n) or O(n log n), enabling longer context without cost explosion.

Why It Dominated

Parallelization: unlike RNNs, all tokens are processed in parallel during training. Better utilizes modern GPUs.

Scaling: performance improves predictably with more parameters, data, and compute (scaling laws).

Transfer learning: pre-trained models on web-scale data transfer capabilities to downstream tasks with minimal fine-tuning.

Common Misconceptions

”Attention = understanding”

Attention is statistical weighting, not semantic comprehension. Attention patterns can be visualized but don’t always correspond to intuitive human interpretations.

”Transformers are only for NLP”

The architecture has extended to vision (ViT), audio (Whisper), proteins (AlphaFold 2), reinforcement learning, and multimodal domains.

”More layers = always better”

Beyond certain depths, gains are marginal and training/inference costs grow. Optimization balances width (hidden size), heads, and training data, not just depth.

Attention Mechanism: core component of Transformer
LLM: main application of Transformer architecture
Embeddings: vector representations processed by Transformer
Tokenization: input preprocessing for Transformer

Sources

Vaswani, A. et al. (2017). Attention Is All You Need. NeurIPS
Dosovitskiy, A. et al. (2020). An Image is Worth 16x16 Words: Transformers for Image Recognition at Scale. ICLR
The Illustrated Transformer - Jay Alammar

Definition

Key Components

Architectural Variants

Computational Complexity

Why It Dominated

Common Misconceptions

”Attention = understanding”

”Transformers are only for NLP”

”More layers = always better”

Sources

Related Articles

Constitutional AI: A Guide for Claude Users

Transformer

Definition

Key Components

Architectural Variants

Computational Complexity

Why It Dominated

Common Misconceptions

”Attention = understanding”

”Transformers are only for NLP”

”More layers = always better”

Related Terms

Sources

Related Articles

Constitutional AI: A Guide for Claude Users