What is Reinforcement Learning from Human Feedback (RLHF)?
The technique that turned GPT-3 into ChatGPT — teaching language models to be helpful.
RLHF is a training technique where human annotators rank or compare multiple LLM outputs, those rankings train a reward model that predicts human preferences, and the base LLM is then fine-tuned via reinforcement learning to maximize that reward model's score. Introduced by OpenAI in 2022 with InstructGPT and ChatGPT, RLHF is the core method that made raw language models helpful and harmless rather than just statistical next-token predictors.
RLHF is a training technique where human annotators rank or compare multiple LLM outputs, those rankings train a reward model that predicts human preferences, and the base LLM is then fine-tuned via reinforcement learning to maximize that reward model's score. Introduced by OpenAI in 2022 with InstructGPT and ChatGPT, RLHF is the core method that made raw language models helpful and harmless rather than just statistical next-token predictors.
In depth
Examples
- →OpenAI InstructGPT (2022) — the paper that named and popularized RLHF; turned GPT-3 into a model that follows instructions
- →OpenAI ChatGPT — the consumer product that made RLHF famous; massive RLHF scale trained on rankings from contractor pools
- →Anthropic Claude — uses Constitutional AI (RLHF plus AI-judged critiques guided by a written constitution)
- →Google Gemini instruction-tuned versions — RLHF-trained on Google's preference data
- →Meta Llama 2/3 Instruct variants — RLHF-trained using open preference datasets and internal human annotation
- →DPO (Direct Preference Optimization, 2023) — simpler alternative to full RLHF, widely adopted for fine-tuning open models
- →OpenAI's o1 and o3 reasoning models — extend RLHF with process reward models that score intermediate reasoning steps
Related terms
Frequently asked questions
Why is RLHF needed if the base model already knows so much?
The base model knows how to continue text in the style of its training data — the open internet — but doesn't specifically know that when a user writes 'how do I file my taxes?' they want an answer rather than a continuation of a tax-related article. RLHF teaches the model the meta-skill of 'output the thing a human would consider a useful response to this input'. Without RLHF, you can get useful outputs with careful prompting, but the model isn't optimizing for helpfulness by default.
Do open-source models use RLHF too?
Yes. Llama 2 and 3 Instruct versions are RLHF-trained. Mistral Instruct, Qwen Instruct, and DeepSeek Chat all use RLHF or RLHF variants (DPO, RLAIF). The base/pretrained versions of these models do not — they're raw next-token predictors. When you hear 'Llama 3 Instruct is good at following instructions', that's RLHF doing its work. The technical quality of open-source RLHF has improved enormously from 2023 to 2026 and is now competitive with closed labs for many tasks.
What's the difference between RLHF and fine-tuning?
Fine-tuning (standard supervised fine-tuning, SFT) trains a model on (prompt, ideal response) pairs — teaching it the one right answer per prompt. RLHF trains on (prompt, response_a, response_b, preference) tuples — teaching it which of multiple responses is preferred. RLHF is strictly more expressive but also more complex to run. Most production fine-tuning of private models done by customers of OpenAI, Anthropic, or open-source Llama is SFT, not RLHF — RLHF is typically done by the model provider at the base-model level.
Can I do RLHF on my own model?
Technically yes but usually not worth it. The raw RLHF loop requires generating samples, collecting rankings (thousands at minimum for meaningful signal), training a reward model, and running PPO fine-tuning — hundreds of GPU-hours and significant engineering. DPO is simpler and runs on consumer GPUs for small models. RLAIF using a stronger model as the judge is cheaper still. But unless you have a specific behavior issue not solvable by prompting + SFT, stick with prompting and SFT. The behavior baked in by the provider's RLHF is usually good enough.
What are the risks or controversies of RLHF?
Several. (1) Value lock-in — RLHF trains in the values and aesthetic preferences of whoever did the annotation, which may not match all users. (2) Sycophancy — models learn to agree with users because agreement was often rated higher than correction. (3) Hallucination reinforcement — if the reward model rewards confident tone, the LLM learns to sound confident even when wrong. (4) Labor questions — large-scale RLHF relies on contractor pools often in lower-wage regions; conditions and compensation have been controversial. Most labs now combine RLHF with other techniques (Constitutional AI, debate, process rewards) to mitigate these.
Run your one-person company.
Hire your AI team in 30 seconds. Start for free.
Free to start · No credit card required · Set up in 30 seconds