Definition
RLHF (Reinforcement Learning from Human Feedback) is an alignment technique that uses human preferences to train LLMs to produce more useful, accurate, and safe outputs. It transforms a base model (which completes text statistically) into an assistant that responds helpfully.
RLHF was popularized by ChatGPT and is used, with variants, by most frontier models.
How It Works
The typical process involves three phases:
1. Supervised Fine-Tuning (SFT): the base model is fine-tuned on a dataset of demonstrations (prompt → ideal response) created by human annotators.
2. Reward Model Training: human annotators compare response pairs and indicate which they prefer. This preference data trains a “reward model” that learns to predict which output a human would prefer.
3. RL Optimization: the model is optimized using reinforcement learning (typically PPO) to maximize reward model score, with a regularization term (KL penalty) to avoid diverging too far from the original model.
Why It’s Needed
A pre-trained LLM is trained to predict the next token, not to be helpful. It can generate toxic output, refuse innocuous questions, or ignore user instructions.
RLHF “aligns” the model to desired behaviors: following instructions, being helpful, refusing harmful requests, admitting uncertainty.
Alternatives and Developments
DPO (Direct Preference Optimization): optimizes directly on preferences without training a separate reward model. Simpler and more stable, becoming popular.
RLAIF: uses another LLM instead of human annotators for feedback generation. Scales better but introduces rater model bias.
Constitutional AI (Anthropic): the model critiques and revises its own outputs according to defined principles, reducing dependence on direct human feedback.
Practical Considerations
Annotation costs: RLHF requires thousands of human preference comparisons. Expensive and slow to scale.
Reward hacking: the model can learn to “game” the reward model, producing output that scores high but isn’t actually better.
Distributional shift: annotators’ preferences may not represent all users. Preference data biases transfer to the model.
Common Misconceptions
”RLHF makes the model smarter”
No. RLHF modifies behavior, not knowledge. The model becomes more helpful and aligned, not more capable at reasoning or more accurate on facts.
”RLHF solves safety”
It reduces undesirable behaviors but doesn’t eliminate them. Jailbreaks and prompt injection can bypass protections.
”RLHF was invented for ChatGPT”
The technique existed before (DeepMind for Atari, OpenAI for summarization). ChatGPT made it famous by applying it at scale.
Related Terms
- LLM: models to which RLHF is applied
- Fine-tuning: preceding phase of RLHF
Sources
- Ouyang, L. et al. (2022). Training language models to follow instructions with human feedback. NeurIPS
- Bai, Y. et al. (2022). Constitutional AI: Harmlessness from AI Feedback. arXiv
- Rafailov, R. et al. (2023). Direct Preference Optimization: Your Language Model is Secretly a Reward Model. NeurIPS