Definition
Fine-tuning is the process of adapting a pre-trained model (typically a foundation model) to a specific task, domain, or style through additional training on a targeted dataset.
The base model has already learned general representations during pre-training. Fine-tuning specializes these representations, enabling superior performance on specific tasks with less data and compute compared to training from scratch.
Types of Fine-tuning
Full fine-tuning: updates all model parameters. Maximum flexibility but requires more compute and risks overfitting/catastrophic forgetting.
Parameter-Efficient Fine-Tuning (PEFT): updates only a subset of parameters.
- LoRA (Low-Rank Adaptation): adds low-rank matrices to existing layers. Reduces trainable parameters by 99%+.
- QLoRA: LoRA on 4-bit quantized models. Enables fine-tuning of 70B models on a single consumer GPU.
- Prefix tuning: adds virtual tokens at the beginning of the sequence.
- Adapter layers: inserts small modules between existing layers.
Instruction tuning: fine-tuning on datasets of (instruction, response) pairs to improve instruction-following.
When to Fine-tune
Recommended for:
- Repetitive tasks with specific output format
- Domains with specialized terminology (legal, medical)
- Need for consistent style/tone
- Classification with many domain-specific classes
- When prompting fails to achieve desired quality
Alternatives to consider:
- Prompt engineering: often sufficient, zero training costs
- RAG: for knowledge retrieval without modifying the model
- Few-shot prompting: examples in the prompt instead of training
Practical Considerations
Dataset: quality beats quantity. 500-1000 high-quality examples often outperform 10K noisy examples. Typical format: (input, expected output) pairs.
Costs: vary enormously. Fine-tuning GPT-4o via API costs ~$25/million training tokens. Self-hosting with LoRA on open models requires GPU (A100: ~$2/hour).
Evaluation: define specific metrics before fine-tuning. Compare against baseline (base model + prompting) to verify added value.
Risks: catastrophic forgetting (loses general capabilities), overfitting (memorizes instead of generalizing), data leakage in test set.
Common Misconceptions
”Fine-tuning is always better than prompting”
No. For many tasks, few-shot prompting on frontier models performs comparably or better, without training costs and with greater flexibility.
”You need a huge dataset”
With PEFT and modern models, hundreds of quality examples can suffice. Focus is on quality and diversity of examples, not quantity.
”Fine-tuning = the model learns new information”
Fine-tuning modifies behaviors and style, but is inefficient for injecting new factual knowledge. RAG is more appropriate for that.
Related Terms
- LLM: models typically subject to fine-tuning
- RLHF: alignment technique applied after fine-tuning
- Foundation Model: starting point for fine-tuning
- RAG: alternative/complement to fine-tuning
Sources
- Hu, E. et al. (2021). LoRA: Low-Rank Adaptation of Large Language Models. ICLR
- Dettmers, T. et al. (2023). QLoRA: Efficient Finetuning of Quantized LLMs. NeurIPS
- OpenAI. Fine-tuning Documentation