Learn

What is Chain-of-Thought Prompting?

The prompting technique that made LLMs do math.

Chain-of-thought (CoT) prompting is a technique where an LLM is instructed to write out its intermediate reasoning steps before producing a final answer. Introduced by Wei et al. at Google Research in 2022, CoT dramatically improves accuracy on arithmetic, commonsense, and symbolic reasoning tasks by forcing the model to decompose problems instead of jumping to an answer.

Free to startNo credit card requiredUpdated Apr 2026
Short answer

Chain-of-thought (CoT) prompting is a technique where an LLM is instructed to write out its intermediate reasoning steps before producing a final answer. Introduced by Wei et al. at Google Research in 2022, CoT dramatically improves accuracy on arithmetic, commonsense, and symbolic reasoning tasks by forcing the model to decompose problems instead of jumping to an answer.

In depth

Chain-of-thought prompting was the first widely-adopted technique that showed you could improve LLM reasoning without retraining the model. The original Wei et al. paper in 2022 demonstrated that asking a large model to 'think step by step' before answering raised GSM8K math-word-problem accuracy from around 18% to over 56% on PaLM 540B. The finding generalized across providers and model families and became the foundation of modern prompt engineering. There are three common flavors. Zero-shot CoT appends the literal phrase 'Let's think step by step' to a prompt and relies on the model's pretraining to produce a rationale. Few-shot CoT includes worked examples in the prompt showing the reasoning trace, so the model mimics the pattern. Self-consistency CoT samples multiple reasoning chains with temperature > 0 and takes a majority vote, trading inference cost for accuracy. All three are still in production use in 2026 alongside newer techniques. CoT works because language models are trained to predict the next token, and more tokens of intermediate work make the final answer more likely to be correct. Each reasoning step constrains the probability distribution of what comes next, which is especially valuable for problems that require multi-step calculation or case analysis. Critically, CoT is emergent at scale: small models (under ~10B parameters) often do worse with CoT than without it, while frontier models like Anthropic Claude 4.5 and OpenAI GPT-5 show strong gains. This is why CoT exploded in usage around 2022 — it required models large enough to actually benefit. The modern successor to raw CoT is the 'reasoning model' pattern introduced by OpenAI o1 in 2024 and extended by Claude 4.5 thinking, DeepSeek R1, and Gemini 2.5 Flash Thinking. These models are trained with reinforcement learning on reasoning traces, so they produce long internal chains of thought automatically and hide the raw trace from the user. Under the hood the mechanism is the same — generate more reasoning tokens, get better answers — but the training is built in rather than prompted. For AI employees, CoT matters because it turns an LLM from an autocomplete engine into a problem-solver. When an AI CFO needs to reconcile an invoice discrepancy, CoT lets it walk through each line item rather than guessing. When an AI developer debugs a stack trace, CoT lets it enumerate hypotheses in order. Tycoon's AI CEO Astra uses CoT implicitly through Claude 4.5's thinking mode — you give her a strategic question, she works through options before recommending one, and you can optionally see the trace.

Examples

  • Zero-shot: appending 'Let's think step by step' to a math word problem and getting the full derivation
  • Few-shot: showing 3 solved algebra problems with worked steps, then the model solves a 4th in the same format
  • Self-consistency: sampling 10 independent CoT traces for a tricky logic puzzle and taking the majority answer
  • OpenAI o1 and GPT-5 with internal reasoning tokens, where the user never sees the chain but benefits from it
  • Anthropic Claude 4.5 thinking mode that produces an extended-thought block before the visible reply
  • DeepSeek R1 showing its full reasoning trace in the output — cheap, transparent, and popular with developers
  • AI support agents using CoT to diagnose a bug report: 'First check the error, then compare to known issues, then propose a fix'

Related terms

Frequently asked questions

Does chain-of-thought actually make answers more accurate or just longer?

Both, but the accuracy gain is real and measurable. On GSM8K (grade-school math), PaLM 540B went from 18% to 57% accuracy with CoT. On BIG-Bench hard reasoning tasks, gains of 10-30 absolute points are typical. The mechanism is not just 'more text equals better' — randomly padding output does nothing. What works is generating genuine intermediate steps that decompose the problem. On tasks that don't require multi-step reasoning (simple factual recall, translation), CoT doesn't help and can slightly hurt by introducing extraneous detail.

Is chain-of-thought still necessary if I'm using a reasoning model like o1 or Claude 4.5 thinking?

Largely no, for the tasks those models were trained on. Reasoning models produce internal CoT automatically and their training teaches them when longer thinking helps. You usually don't need to prompt 'think step by step' — the model decides. But for tasks outside their training distribution (domain-specific workflows, unusual formats), explicit CoT prompting still helps because it shapes how the model structures its reasoning. The rule of thumb: if you're getting wrong answers from a reasoning model, add explicit CoT instructions describing the reasoning structure you want.

What's the cost difference between CoT and direct answering?

CoT typically produces 5-20x more output tokens than a direct answer. Output tokens are 3-5x more expensive than input tokens on most providers, so a CoT response can cost 15-100x a direct one. For batch scoring or classification at scale this matters a lot. Workarounds: use self-consistency only for ambiguous cases, cache the reasoning trace when the same question repeats, or switch to a smaller reasoning-trained model like DeepSeek R1 Distill which gives much of the benefit cheaper. For one-off strategic questions the cost is trivial and CoT is always worth it.

When does CoT hurt performance?

Three scenarios. (1) Small models under ~10B parameters — they don't have enough capability to reason and the extra tokens just add noise. Skip CoT and use direct prompting. (2) Tasks that require no reasoning — sentiment classification, translation, grammar correction — where CoT introduces overconfident rationales. (3) Latency-sensitive interfaces — a 400ms direct answer beats a 4-second CoT answer for autocomplete or voice. The mitigation is to reserve CoT for hard multi-step tasks and use direct prompting elsewhere.

Who invented chain-of-thought prompting?

Jason Wei and colleagues at Google Research published the seminal paper 'Chain-of-Thought Prompting Elicits Reasoning in Large Language Models' at NeurIPS 2022. The zero-shot variant 'Let's think step by step' was popularized by Kojima et al. in a concurrent 2022 paper. The technique built on earlier work on scratchpads and rationale generation, but the Wei et al. paper is the one that made it standard practice. Both papers are still among the most-cited in LLM research.

Run your one-person company.

Hire your AI team in 30 seconds. Start for free.

Free to start · No credit card required · Set up in 30 seconds