Reinforcement Learning from Human Feedback (RLHF)

Definition

RLHF (Reinforcement Learning from Human Feedback) is an alignment technique that uses human preferences to train LLMs to produce more useful, accurate, and safe outputs. It transforms a base model (which completes text statistically) into an assistant that responds helpfully.

RLHF was popularized by ChatGPT and is used, with variants, by most frontier models.

How It Works

The typical process involves three phases:

1. Supervised Fine-Tuning (SFT): the base model is fine-tuned on a dataset of demonstrations (prompt → ideal response) created by human annotators.

2. Reward Model Training: human annotators compare response pairs and indicate which they prefer. This preference data trains a “reward model” that learns to predict which output a human would prefer.

3. RL Optimization: the model is optimized using reinforcement learning (typically PPO) to maximize reward model score, with a regularization term (KL penalty) to avoid diverging too far from the original model.

Why It’s Needed

A pre-trained LLM is trained to predict the next token, not to be helpful. It can generate toxic output, refuse innocuous questions, or ignore user instructions.

RLHF “aligns” the model to desired behaviors: following instructions, being helpful, refusing harmful requests, admitting uncertainty.

Alternatives and Developments

DPO (Direct Preference Optimization): optimizes directly on preferences without training a separate reward model. Simpler and more stable, becoming popular.

RLAIF: uses another LLM instead of human annotators for feedback generation. Scales better but introduces rater model bias.

Constitutional AI (Anthropic): the model critiques and revises its own outputs according to defined principles, reducing dependence on direct human feedback.

Practical Considerations

Annotation costs: RLHF requires thousands of human preference comparisons. Expensive and slow to scale.

Reward hacking: the model can learn to “game” the reward model, producing output that scores high but isn’t actually better.

Distributional shift: annotators’ preferences may not represent all users. Preference data biases transfer to the model.

Common Misconceptions

”RLHF makes the model smarter”

No. RLHF modifies behavior, not knowledge. The model becomes more helpful and aligned, not more capable at reasoning or more accurate on facts.

”RLHF solves safety”

It reduces undesirable behaviors but doesn’t eliminate them. Jailbreaks and prompt injection can bypass protections.

”RLHF was invented for ChatGPT”

The technique existed before (DeepMind for Atari, OpenAI for summarization). ChatGPT made it famous by applying it at scale.

LLM: models to which RLHF is applied
Fine-tuning: preceding phase of RLHF

Sources

Ouyang, L. et al. (2022). Training language models to follow instructions with human feedback. NeurIPS
Bai, Y. et al. (2022). Constitutional AI: Harmlessness from AI Feedback. arXiv
Rafailov, R. et al. (2023). Direct Preference Optimization: Your Language Model is Secretly a Reward Model. NeurIPS

Reinforcement Learning from Human Feedback (RLHF)

Definition

How It Works

Why It’s Needed

Alternatives and Developments

Practical Considerations

Common Misconceptions

”RLHF makes the model smarter”

”RLHF solves safety”

”RLHF was invented for ChatGPT”

Sources

Related Articles

Constitutional AI: A Guide for Claude Users

Reinforcement Learning from Human Feedback (RLHF)

Definition

How It Works

Why It’s Needed

Alternatives and Developments

Practical Considerations

Common Misconceptions

”RLHF makes the model smarter”

”RLHF solves safety”

”RLHF was invented for ChatGPT”

Related Terms

Sources

Related Articles

Constitutional AI: A Guide for Claude Users