Foundation Model

Definition

A foundation model is a large deep learning model pre-trained on vast amounts of unlabeled data (web-scale) that can be adapted to a wide range of specific downstream tasks. The term, coined by the Stanford HAI Institute in 2021, reflects the modern transfer learning paradigm where pre-training represents the computationally most expensive phase.

Foundation models distinguish themselves from traditional models through:

Massive scale: billions of parameters and training data over hundreds of billions of tokens
Multimodal potential: the same model (e.g., GPT-4) can process text, images, audio
Versatility: adaptable to very different tasks without architectural changes
Emergent abilities: capabilities that appear only at certain scales (zero-shot learning, chain-of-thought reasoning)

Key Characteristics

Pre-training on generic data: the model is trained on broad datasets (Common Crawl, Wikipedia, books, code) without labels, with objectives like next-token prediction or masked language modeling.

Transfer learning: the pre-trained model captures generic linguistic/visual patterns that transfer to downstream tasks.

Task flexibility: the same model can be fine-tuned, prompt-engineered, or used in-context for classification, generation, reasoning, and more without architectural modification.

Cost asymmetry: pre-training is expensive (millions of dollars in compute) but amortized across millions of tasks. Downstream adaptation is relatively inexpensive.

Adaptation Paradigms

Fine-tuning: updating model parameters on task-specific data. Full fine-tuning modifies all parameters (expensive). Parameter-efficient fine-tuning (LoRA, QLoRA, prefix tuning) modifies less than 1% of parameters.

Prompt Engineering: careful prompt formulation to extract model capabilities without updating parameters. Zero-shot, few-shot, and chain-of-thought are prompt engineering techniques.

In-Context Learning: the model learns from examples in the prompt (few-shot) without parameter updates. Emergent property at high scales.

Retrieval-Augmented Adaptation: the model accesses external knowledge bases through retrieval to augment responses. Hybrid between static fine-tuning and dynamic adaptation.

Major Foundation Models (2025)

Text (LLM):

Closed-source: GPT-4/4o, Claude 3.5, Gemini 1.5
Open-weights: Llama 3, Mistral, Qwen, Phi, DeepSeek

Vision:

Closed-source: GPT-4o, Claude 3.5, Gemini 1.5
Open-weights: Vision Transformer (ViT), LLaVA, Qwen-VL

Multimodal:

Closed-source: GPT-4o, Claude 3.5, Gemini 1.5
Open-weights: LLaVA, CogVLM

Code:

Specialized models: CodeLlama, Copilot, Claude 3.5

Practical Considerations

Model selection: depends on task, latency requirements, cost budget, privacy constraints. A frontier model costs 10-100x more than an equivalent open-source model.

Licensing: open-weights models have varying licenses. Llama has commercial restrictions on entities beyond certain capital thresholds. Carefully consider for commercial deployment.

Continual learning: foundation models do not learn from post-deployment interactions without retraining. For evolving scenarios, RAG or periodic fine-tuning is necessary.

Bias mitigation: the model inherits biases from training data (e.g., gender bias in web-scale datasets). Mitigation through RLHF, fine-tuning on balanced data, or prompt engineering, but does not eliminate the problem completely.

Common Misconceptions

”A foundation model solves everything”

No. A generic model may perform poorly on highly specialized domains (medicine, law) where fine-tuning on in-domain data is critical.

”Once pre-trained, it costs nothing to adapt”

Fine-tuning has non-negligible computational costs (GPU, storage). For inference, each request has compute cost. At scale, the TCO of adaptation becomes relevant.

”The largest foundation model is always better”

Depends. For many tasks, 7B-13B fine-tuned models outperform 100B+ models on specific metrics, with significantly lower latency and cost.

LLM: category of foundation models for natural language
Fine-tuning: adaptation technique for foundation models
Prompt Engineering: art of formulating prompts to extract foundation model capabilities
Transformer: underlying architecture of modern foundation models
Transfer Learning: general paradigm of which foundation models are an instance

Sources

Bommasani, R. et al. (2021). On the Opportunities and Risks of Foundation Models. arXiv (comprehensive survey)
Dosovitskiy, A. et al. (2020). An Image is Worth 16x16 Words: Transformers for Image Recognition at Scale. ICLR
Lester, B. et al. (2021). The Power of Scale for Parameter-Efficient Prompt Tuning. EMNLP
Stanford Foundation Model Hub: tracking and evaluating foundation models

Definition

Key Characteristics

Adaptation Paradigms

Major Foundation Models (2025)

Practical Considerations

Common Misconceptions

”A foundation model solves everything”

”Once pre-trained, it costs nothing to adapt”

”The largest foundation model is always better”

Sources

Related Articles

AI 2026: Why Stanford Talks About a Reckoning

Foundation Model

Definition

Key Characteristics

Adaptation Paradigms

Major Foundation Models (2025)

Practical Considerations

Common Misconceptions

”A foundation model solves everything”

”Once pre-trained, it costs nothing to adapt”

”The largest foundation model is always better”

Related Terms

Sources

Related Articles

AI 2026: Why Stanford Talks About a Reckoning