AI Architecture DefinedTerm

Transformer

Also known as: Transformer Architecture, Transformer Model

Deep learning architecture based on the attention mechanism, foundation of all modern LLMs and generative models.

Updated: 2026-01-03

Definition

The Transformer is a neural network architecture introduced in 2017 by Vaswani et al. (Google) in “Attention Is All You Need”. Unlike recurrent architectures (RNN, LSTM), it processes entire sequences in parallel using self-attention mechanisms.

It’s the foundation of all modern LLMs (GPT, Claude, Llama), embedding models, and many computer vision models (Vision Transformers).

Key Components

Self-Attention: mechanism allowing each position in the sequence to “attend” to all other positions, weighting their relative importance. Captures long-range dependencies that RNNs struggled with.

Multi-Head Attention: multiple attention “heads” in parallel, each learning different patterns. Typically 8-96 heads per layer.

Feed-Forward Networks: fully-connected layers applied independently to each position after attention.

Positional Encoding: signal added to embeddings to encode position in sequence (attention is inherently position-invariant).

Layer Normalization and Residual Connections: stabilization techniques enabling training of very deep networks.

Architectural Variants

Encoder-only (BERT): processes entire sequence bidirectionally. Used for classification, NER, embeddings.

Decoder-only (GPT, Llama, Claude): autoregressive, generates tokens one at a time conditioned on previous ones. Dominates text generation.

Encoder-Decoder (T5, BART, original Transformer): encoder processes input, decoder generates output. Used for translation, summarization.

Computational Complexity

Attention has O(n²) complexity relative to sequence length n. For context windows of 100K+ tokens, this becomes prohibitive.

Efficient variants (Flash Attention, Sparse Attention, Linear Attention) reduce complexity to O(n) or O(n log n), enabling longer context without cost explosion.

Why It Dominated

Parallelization: unlike RNNs, all tokens are processed in parallel during training. Better utilizes modern GPUs.

Scaling: performance improves predictably with more parameters, data, and compute (scaling laws).

Transfer learning: pre-trained models on web-scale data transfer capabilities to downstream tasks with minimal fine-tuning.

Common Misconceptions

”Attention = understanding”

Attention is statistical weighting, not semantic comprehension. Attention patterns can be visualized but don’t always correspond to intuitive human interpretations.

”Transformers are only for NLP”

The architecture has extended to vision (ViT), audio (Whisper), proteins (AlphaFold 2), reinforcement learning, and multimodal domains.

”More layers = always better”

Beyond certain depths, gains are marginal and training/inference costs grow. Optimization balances width (hidden size), heads, and training data, not just depth.

Sources

Related Articles

Articles that cover Transformer as a primary or secondary topic.