Definition
A foundation model is a large deep learning model pre-trained on vast amounts of unlabeled data (web-scale) that can be adapted to a wide range of specific downstream tasks. The term, coined by the Stanford HAI Institute in 2021, reflects the modern transfer learning paradigm where pre-training represents the computationally most expensive phase.
Foundation models distinguish themselves from traditional models through:
- Massive scale: billions of parameters and training data over hundreds of billions of tokens
- Multimodal potential: the same model (e.g., GPT-4) can process text, images, audio
- Versatility: adaptable to very different tasks without architectural changes
- Emergent abilities: capabilities that appear only at certain scales (zero-shot learning, chain-of-thought reasoning)
Key Characteristics
Pre-training on generic data: the model is trained on broad datasets (Common Crawl, Wikipedia, books, code) without labels, with objectives like next-token prediction or masked language modeling.
Transfer learning: the pre-trained model captures generic linguistic/visual patterns that transfer to downstream tasks.
Task flexibility: the same model can be fine-tuned, prompt-engineered, or used in-context for classification, generation, reasoning, and more without architectural modification.
Cost asymmetry: pre-training is expensive (millions of dollars in compute) but amortized across millions of tasks. Downstream adaptation is relatively inexpensive.
Adaptation Paradigms
Fine-tuning: updating model parameters on task-specific data. Full fine-tuning modifies all parameters (expensive). Parameter-efficient fine-tuning (LoRA, QLoRA, prefix tuning) modifies less than 1% of parameters.
Prompt Engineering: careful prompt formulation to extract model capabilities without updating parameters. Zero-shot, few-shot, and chain-of-thought are prompt engineering techniques.
In-Context Learning: the model learns from examples in the prompt (few-shot) without parameter updates. Emergent property at high scales.
Retrieval-Augmented Adaptation: the model accesses external knowledge bases through retrieval to augment responses. Hybrid between static fine-tuning and dynamic adaptation.
Major Foundation Models (2025)
Text (LLM):
- Closed-source: GPT-4/4o, Claude 3.5, Gemini 1.5
- Open-weights: Llama 3, Mistral, Qwen, Phi, DeepSeek
Vision:
- Closed-source: GPT-4o, Claude 3.5, Gemini 1.5
- Open-weights: Vision Transformer (ViT), LLaVA, Qwen-VL
Multimodal:
- Closed-source: GPT-4o, Claude 3.5, Gemini 1.5
- Open-weights: LLaVA, CogVLM
Code:
- Specialized models: CodeLlama, Copilot, Claude 3.5
Practical Considerations
Model selection: depends on task, latency requirements, cost budget, privacy constraints. A frontier model costs 10-100x more than an equivalent open-source model.
Licensing: open-weights models have varying licenses. Llama has commercial restrictions on entities beyond certain capital thresholds. Carefully consider for commercial deployment.
Continual learning: foundation models do not learn from post-deployment interactions without retraining. For evolving scenarios, RAG or periodic fine-tuning is necessary.
Bias mitigation: the model inherits biases from training data (e.g., gender bias in web-scale datasets). Mitigation through RLHF, fine-tuning on balanced data, or prompt engineering, but does not eliminate the problem completely.
Common Misconceptions
”A foundation model solves everything”
No. A generic model may perform poorly on highly specialized domains (medicine, law) where fine-tuning on in-domain data is critical.
”Once pre-trained, it costs nothing to adapt”
Fine-tuning has non-negligible computational costs (GPU, storage). For inference, each request has compute cost. At scale, the TCO of adaptation becomes relevant.
”The largest foundation model is always better”
Depends. For many tasks, 7B-13B fine-tuned models outperform 100B+ models on specific metrics, with significantly lower latency and cost.
Related Terms
- LLM: category of foundation models for natural language
- Fine-tuning: adaptation technique for foundation models
- Prompt Engineering: art of formulating prompts to extract foundation model capabilities
- Transformer: underlying architecture of modern foundation models
- Transfer Learning: general paradigm of which foundation models are an instance
Sources
- Bommasani, R. et al. (2021). On the Opportunities and Risks of Foundation Models. arXiv (comprehensive survey)
- Dosovitskiy, A. et al. (2020). An Image is Worth 16x16 Words: Transformers for Image Recognition at Scale. ICLR
- Lester, B. et al. (2021). The Power of Scale for Parameter-Efficient Prompt Tuning. EMNLP
- Stanford Foundation Model Hub: tracking and evaluating foundation models