Attention (Machine Learning)

Definition

The attention mechanism is a component of neural networks that allows the model to selectively weight the relative importance of different parts of the input when processing each position. Instead of treating the entire sequence uniformly, attention enables the model to “focus” on relevant tokens or features, computing probability distributions over dependencies.

Formally, attention computes a weighted average of values based on the similarity between a query and keys through a scoring function.

How It Works

The attention mechanism follows the Query-Key-Value paradigm introduced by Vaswani et al. (2017):

Scaled Dot-Product Attention:

Linear projections: the input is projected into three spaces: Query (Q), Key (K), Value (V) through learnable matrices W^Q, W^K, W^V
Scoring: the similarity between query and keys is computed as dot product: score = Q · K^T
Normalization: scores are normalized for numerical stability: score = score / sqrt(d_k)
Softmax: normalized scores are converted to probabilities: attention_weights = softmax(score)
Weighted output: the result is a weighted average of values: output = attention_weights · V

This computation happens in parallel across multiple attention “heads” (multi-head attention), allowing the model to attend to different patterns simultaneously (e.g., one head may attend to syntactic dependencies, another to semantic relations).

Variants and Architectures

Self-Attention: the model attends to its own previous positions. Used in Transformers to capture intra-sequence dependencies.

Cross-Attention: the model attends to a different sequence (e.g., encoder output in an encoder-decoder architecture). Used in seq2seq models like the original neural translator.

Multi-Head Attention: k heads in parallel with different linear projections. Enables learning complementary patterns on different subspaces (e.g., positional vs. semantic).

Sparse Attention: attention over a selected subset of positions instead of all. Reduces computational complexity from O(n²) to O(n log n) for long sequences. Implementations: Longformer, BigBird.

Flash Attention: computational optimization that reduces main memory access through block-wise algorithms, improving latency without changing semantics.

Use Cases and Applications

Generative models: in autoregressive Transformers (GPT, Claude, Gemini), attention enables coherent long-text generation while maintaining long-range dependencies.

Machine translation: cross-layer attention allows the encoder to effectively “translate” by focusing on relevant words in the source text.

Question answering: attention identifies which part of the text contains the answer to a question, facilitating extraction.

Vision: in Vision Transformers, attention operating on image patches enables capturing complex spatial relationships, achieving performance comparable to or exceeding CNNs.

Practical Considerations

Computational complexity: attention has O(n²) complexity in sequence length. For 32K token sequences, it requires significant memory and inference time. Optimizations like Flash Attention have become critical for deployment.

Context window: modern models have context windows up to 1M tokens, but full attention over all positions remains a constraint. Strategies like sliding window attention or sparse patterns are necessary for scaling.

Training vs. inference: during training, attention has access to the complete sequence (parallelizable). During inference, each new token requires recalculating attention history, making token-by-token generation the bottleneck.

KV-cache: to optimize inference, models maintain cache of previous keys and values, avoiding recalculation. Increases memory occupancy but drastically reduces latency.

Common Misconceptions

”Attention allows the model to ‘understand’ relationships”

Attention is a statistical-algebraic mechanism, not cognitive. It computes weights based on geometric similarity in embedding space, not on “understanding”. The model may attend to spurious correlations.

”Attending to distant tokens is always better”

No. Relative position matters (weak inductive bias in Transformers). Distant tokens have less intrinsic signal, and adding more dependencies can introduce noise. Optimal context length depends on the task.

”Self-attention is always preferable”

Not always. For well-defined tasks with clear relational schemas (e.g., parsing), structured mechanisms (graph neural networks) can be more efficient and interpretable.

Transformer: architecture that uses attention as a central component
LLM: models that depend entirely on stacked attention layers
Embeddings: the queries, keys, values are projections of embeddings
Fine-tuning: adaptation of attention parameters to new tasks

Sources

Vaswani, A. et al. (2017). Attention Is All You Need. NeurIPS
Devlin, J. et al. (2018). BERT: Pre-training of Deep Bidirectional Transformers. arXiv:1810.04805
Raffel, C. et al. (2019). Exploring the Limits of Transfer Learning with a Unified Text-to-Text Transformer. JMLR
Dao, T. et al. (2022). FlashAttention: Fast and Memory-Efficient Exact Attention with IO-Awareness. arXiv