Learn

What is Reinforcement Learning from Human Feedback (RLHF)?

The technique that turned GPT-3 into ChatGPT — teaching language models to be helpful.

RLHF is a training technique where human annotators rank or compare multiple LLM outputs, those rankings train a reward model that predicts human preferences, and the base LLM is then fine-tuned via reinforcement learning to maximize that reward model's score. Introduced by OpenAI in 2022 with InstructGPT and ChatGPT, RLHF is the core method that made raw language models helpful and harmless rather than just statistical next-token predictors.

Free to startNo credit card requiredUpdated Apr 2026
Short answer

RLHF is a training technique where human annotators rank or compare multiple LLM outputs, those rankings train a reward model that predicts human preferences, and the base LLM is then fine-tuned via reinforcement learning to maximize that reward model's score. Introduced by OpenAI in 2022 with InstructGPT and ChatGPT, RLHF is the core method that made raw language models helpful and harmless rather than just statistical next-token predictors.

In depth

A base LLM like GPT-3 or Llama 3.1 pretrained is a next-token predictor. It completes text in the style of its training data, which is the open internet — so asking 'how do I file my taxes?' might get you a plausibly continuing article, not an actual answer to your question. It knows an enormous amount but doesn't know that 'answer the user's question helpfully' is what you want. RLHF is how you teach it. The technique has three stages, formalized by OpenAI in the 2022 InstructGPT paper. (1) Supervised fine-tuning (SFT): human demonstrators write ideal responses to sample prompts, and the base model is fine-tuned on those demonstrations. This alone gets you part of the way. (2) Reward model training: for each prompt, multiple model outputs are generated and human annotators rank them from best to worst. A separate reward model — typically a smaller transformer — is trained to predict which output a human would prefer, so it can score any output numerically. (3) Reinforcement learning (usually PPO, Proximal Policy Optimization): the SFT model is further fine-tuned to maximize the reward model's score, with a KL-divergence penalty to prevent it from drifting too far from the SFT baseline. The magic is in stage 2 — the reward model. It turns millions of messy, qualitative human preferences into a numerical signal the RL process can optimize against. Without it, you'd need a human in the loop of every training step, which doesn't scale. With it, the reward model can score billions of candidate outputs and the RL process iteratively pushes the LLM toward outputs the reward model (and by extension, humans) prefer. RLHF is how every major instruction-tuned model was made helpful: OpenAI's GPT-3.5, 4, 5; Anthropic's Claude family; Google's Gemini; Meta's Llama 2 and 3 instruction variants; Mistral Instruct; Grok. The specifics differ — Anthropic's Constitutional AI adds self-critique rounds where an AI judge applies rules from a written constitution; OpenAI has experimented with Process Reward Models that score reasoning steps rather than final outputs — but the core RLHF loop is universal. RLHF has known problems. (1) Reward hacking: the LLM learns to optimize the reward model in ways humans wouldn't actually prefer — e.g., producing confidently wrong answers because the reward model rated confident tones highly, or sycophancy because the reward model learned human annotators liked being agreed with. (2) Annotator bias: the preferences the reward model learns reflect the biases of the annotators (often contractor pools with specific demographic and cultural characteristics). (3) Cost: large-scale RLHF requires thousands of human annotation hours — out of reach for individuals but routine for labs with millions of dollars of training budget. (4) Capability regression: naive RLHF can make models worse at skills humans don't rate — harder-to-evaluate reasoning, rare coding skills — if the ranking tasks don't include them. Recent advances include DPO (Direct Preference Optimization, 2023) which skips the separate reward model by training directly on preference pairs, making RLHF-style training cheaper and more stable; RLAIF (Reinforcement Learning from AI Feedback) which uses LLMs to do the ranking that humans used to do, scaling annotation effectively infinitely; and Constitutional AI which supplements human feedback with AI-generated critiques guided by a written constitution. All of these are variations on the core RLHF idea. For builders: you almost never run RLHF yourself. You use an already-RLHF'd base model (Claude, GPT, Gemini, Llama Instruct) and shape its behavior through prompting and optional fine-tuning. RLHF is what the model provider did to produce a model that follows your instructions in the first place.

Examples

  • OpenAI InstructGPT (2022) — the paper that named and popularized RLHF; turned GPT-3 into a model that follows instructions
  • OpenAI ChatGPT — the consumer product that made RLHF famous; massive RLHF scale trained on rankings from contractor pools
  • Anthropic Claude — uses Constitutional AI (RLHF plus AI-judged critiques guided by a written constitution)
  • Google Gemini instruction-tuned versions — RLHF-trained on Google's preference data
  • Meta Llama 2/3 Instruct variants — RLHF-trained using open preference datasets and internal human annotation
  • DPO (Direct Preference Optimization, 2023) — simpler alternative to full RLHF, widely adopted for fine-tuning open models
  • OpenAI's o1 and o3 reasoning models — extend RLHF with process reward models that score intermediate reasoning steps

Related terms

Frequently asked questions

Why is RLHF needed if the base model already knows so much?

The base model knows how to continue text in the style of its training data — the open internet — but doesn't specifically know that when a user writes 'how do I file my taxes?' they want an answer rather than a continuation of a tax-related article. RLHF teaches the model the meta-skill of 'output the thing a human would consider a useful response to this input'. Without RLHF, you can get useful outputs with careful prompting, but the model isn't optimizing for helpfulness by default.

Do open-source models use RLHF too?

Yes. Llama 2 and 3 Instruct versions are RLHF-trained. Mistral Instruct, Qwen Instruct, and DeepSeek Chat all use RLHF or RLHF variants (DPO, RLAIF). The base/pretrained versions of these models do not — they're raw next-token predictors. When you hear 'Llama 3 Instruct is good at following instructions', that's RLHF doing its work. The technical quality of open-source RLHF has improved enormously from 2023 to 2026 and is now competitive with closed labs for many tasks.

What's the difference between RLHF and fine-tuning?

Fine-tuning (standard supervised fine-tuning, SFT) trains a model on (prompt, ideal response) pairs — teaching it the one right answer per prompt. RLHF trains on (prompt, response_a, response_b, preference) tuples — teaching it which of multiple responses is preferred. RLHF is strictly more expressive but also more complex to run. Most production fine-tuning of private models done by customers of OpenAI, Anthropic, or open-source Llama is SFT, not RLHF — RLHF is typically done by the model provider at the base-model level.

Can I do RLHF on my own model?

Technically yes but usually not worth it. The raw RLHF loop requires generating samples, collecting rankings (thousands at minimum for meaningful signal), training a reward model, and running PPO fine-tuning — hundreds of GPU-hours and significant engineering. DPO is simpler and runs on consumer GPUs for small models. RLAIF using a stronger model as the judge is cheaper still. But unless you have a specific behavior issue not solvable by prompting + SFT, stick with prompting and SFT. The behavior baked in by the provider's RLHF is usually good enough.

What are the risks or controversies of RLHF?

Several. (1) Value lock-in — RLHF trains in the values and aesthetic preferences of whoever did the annotation, which may not match all users. (2) Sycophancy — models learn to agree with users because agreement was often rated higher than correction. (3) Hallucination reinforcement — if the reward model rewards confident tone, the LLM learns to sound confident even when wrong. (4) Labor questions — large-scale RLHF relies on contractor pools often in lower-wage regions; conditions and compensation have been controversial. Most labs now combine RLHF with other techniques (Constitutional AI, debate, process rewards) to mitigate these.

Run your one-person company.

Hire your AI team in 30 seconds. Start for free.

Free to start · No credit card required · Set up in 30 seconds