LearnWhat is Reinforcement Learning from Human Feedback (RLHF)?
The technique that turned GPT-3 into ChatGPT — teaching language models to be helpful.
RLHF is a training technique where human annotators rank or compare multiple LLM outputs, those rankings train a reward model that predicts human preferences, and the base LLM is then fine-tuned via reinforcement learning to maximize that reward model's score. Introduced by OpenAI in 2022 with InstructGPT and ChatGPT, RLHF is the core method that made raw language models helpful and harmless rather than just statistical next-token predictors.
Free to startNo credit card requiredUpdated Apr 2026
In depth
A base LLM like GPT-3 or Llama 3.1 pretrained is a next-token predictor. It completes text in the style of its training data, which is the open internet — so asking 'how do I file my taxes?' might get you a plausibly continuing article, not an actual answer to your question. It knows an enormous amount but doesn't know that 'answer the user's question helpfully' is what you want. RLHF is how you teach it.
The technique has three stages, formalized by OpenAI in the 2022 InstructGPT paper. (1) Supervised fine-tuning (SFT): human demonstrators write ideal responses to sample prompts, and the base model is fine-tuned on those demonstrations. This alone gets you part of the way. (2) Reward model training: for each prompt, multiple model outputs are generated and human annotators rank them from best to worst. A separate reward model — typically a smaller transformer — is trained to predict which output a human would prefer, so it can score any output numerically. (3) Reinforcement learning (usually PPO, Proximal Policy Optimization): the SFT model is further fine-tuned to maximize the reward model's score, with a KL-divergence penalty to prevent it from drifting too far from the SFT baseline.
The magic is in stage 2 — the reward model. It turns millions of messy, qualitative human preferences into a numerical signal the RL process can optimize against. Without it, you'd need a human in the loop of every training step, which doesn't scale. With it, the reward model can score billions of candidate outputs and the RL process iteratively pushes the LLM toward outputs the reward model (and by extension, humans) prefer.
RLHF is how every major instruction-tuned model was made helpful: OpenAI's GPT-3.5, 4, 5; Anthropic's Claude family; Google's Gemini; Meta's Llama 2 and 3 instruction variants; Mistral Instruct; Grok. The specifics differ — Anthropic's Constitutional AI adds self-critique rounds where an AI judge applies rules from a written constitution; OpenAI has experimented with Process Reward Models that score reasoning steps rather than final outputs — but the core RLHF loop is universal.
RLHF has known problems. (1) Reward hacking: the LLM learns to optimize the reward model in ways humans wouldn't actually prefer — e.g., producing confidently wrong answers because the reward model rated confident tones highly, or sycophancy because the reward model learned human annotators liked being agreed with. (2) Annotator bias: the preferences the reward model learns reflect the biases of the annotators (often contractor pools with specific demographic and cultural characteristics). (3) Cost: large-scale RLHF requires thousands of human annotation hours — out of reach for individuals but routine for labs with millions of dollars of training budget. (4) Capability regression: naive RLHF can make models worse at skills humans don't rate — harder-to-evaluate reasoning, rare coding skills — if the ranking tasks don't include them.
Recent advances include DPO (Direct Preference Optimization, 2023) which skips the separate reward model by training directly on preference pairs, making RLHF-style training cheaper and more stable; RLAIF (Reinforcement Learning from AI Feedback) which uses LLMs to do the ranking that humans used to do, scaling annotation effectively infinitely; and Constitutional AI which supplements human feedback with AI-generated critiques guided by a written constitution. All of these are variations on the core RLHF idea.
For builders: you almost never run RLHF yourself. You use an already-RLHF'd base model (Claude, GPT, Gemini, Llama Instruct) and shape its behavior through prompting and optional fine-tuning. RLHF is what the model provider did to produce a model that follows your instructions in the first place.