Learn

What is LLM Fine-Tuning?

When prompting isn't enough — teach the model directly.

Fine-tuning is the process of continuing to train a pretrained LLM on a smaller, task-specific dataset so the model internalizes a style, domain, or output format. It updates model weights, unlike prompting which leaves the model unchanged. Modern fine-tuning uses parameter-efficient methods like LoRA to update a small fraction of weights, making it affordable at $10-$500 per run rather than the millions required for pretraining.

Free to startNo credit card requiredUpdated Apr 2026
Short answer

Fine-tuning is the process of continuing to train a pretrained LLM on a smaller, task-specific dataset so the model internalizes a style, domain, or output format. It updates model weights, unlike prompting which leaves the model unchanged. Modern fine-tuning uses parameter-efficient methods like LoRA to update a small fraction of weights, making it affordable at $10-$500 per run rather than the millions required for pretraining.

In depth

Pretrained LLMs like Claude 4.5 or GPT-5 come out of the factory knowing general language and broad world knowledge. Fine-tuning teaches them specific behaviors that prompting can only approximate — your brand voice, your API response format, a rare domain vocabulary, or the nuance of your company's tone. The model goes from 'smart generalist' to 'smart specialist for this task' while keeping most of its general capability. There are four main flavors. (1) Supervised fine-tuning (SFT): train on input-output pairs where you show the model what you want. 500-5000 examples is typical. This is the bread-and-butter approach — most 'fine-tunes' in production are SFT. (2) Parameter-efficient fine-tuning (PEFT), usually LoRA or QLoRA: only a tiny fraction of weights get updated, making training 10-100x cheaper and letting you store dozens of LoRA adapters per base model. Open-source PEFT is standard practice in 2026. (3) Reinforcement learning from human feedback (RLHF): train a reward model on pairs of better/worse outputs, then RL the LLM against the reward. Expensive but produces the best quality for subjective tasks; this is how ChatGPT became ChatGPT. (4) Direct preference optimization (DPO), introduced by Rafailov et al. (2023): a simpler alternative to RLHF that directly optimizes on preference pairs without a separate reward model. DPO has largely replaced classical RLHF for alignment fine-tuning because it's simpler and nearly as effective. Fine-tuning is not always the right answer. The two most common mistakes are fine-tuning when you should have used RAG (you need the model to know current facts, not new behaviors) and fine-tuning when you should have prompted harder (you have 10 examples, which is not enough). Rough decision rule: if the task is 'produce outputs in this specific style/format I can show you' and you have 200+ examples, fine-tune. If the task is 'answer questions using my current knowledge base,' use RAG. If you're unsure, start with prompting and a few-shot examples; escalate to fine-tuning only when prompting plateaus below your quality bar. The economics changed dramatically with LoRA and QLoRA. Full fine-tuning a 70B model requires multiple A100/H100 GPUs and thousands of dollars. LoRA fine-tuning the same model on a single consumer GPU runs $10-$100. Managed fine-tuning services — OpenAI's fine-tuning API, Anthropic's fine-tuning for Claude, Fireworks, Together, Modal — make it a zero-ops operation for open and closed models. The practical cost for most startups is dominated by data preparation (curating and cleaning 1K-5K examples), not compute. For AI agents, fine-tuning matters for a narrow but important slice of use cases: enforcing output schema reliably, internalizing a brand voice, or learning a rare API grammar. Tycoon generally doesn't fine-tune the base models that power Astra — the frontier models from Anthropic and OpenAI are strong enough for most tasks and getting stronger. We reach for fine-tuning when a specific skill (programmatic SEO page generation, code refactoring to an unusual house style) needs consistent, format-locked outputs that prompting struggles to enforce.

Examples

  • GPT-3.5-turbo fine-tuned on 500 customer-support conversation examples to match a company's tone
  • Llama 3.1 70B + LoRA trained on 2000 legal briefs to produce style-matched first drafts
  • Code model fine-tuned on a proprietary codebase to enforce internal naming and architecture conventions
  • Classification fine-tune: 5000 labeled examples of spam vs real, trained on a small base model for cheap inference
  • DPO fine-tune on 3000 preference pairs ('reply A is better than reply B') to improve helpfulness without RLHF infrastructure
  • Medical fine-tune: PubMedQA + MedQA training gives a base model clinical reasoning chops it didn't have generically
  • Anthropic's Constitutional AI — a specific form of RLHF/RLAIF fine-tuning used to align Claude's behavior

Related terms

Frequently asked questions

Should I fine-tune or use RAG?

RAG for facts, fine-tuning for behavior. If the problem is 'the model doesn't know my company's pricing page' — use RAG. If the problem is 'the model writes in the wrong tone, produces the wrong format, or can't follow my API grammar' — fine-tune. Most production systems use both. Fine-tune the model to speak in your style and produce your output format, then use RAG to inject current facts. Don't fine-tune on facts that change weekly; the model will drift and you'll have to retrain constantly. Don't use RAG for stylistic constraints; it's unreliable and expensive.

How much data do I need to fine-tune?

Depends on the task. For output-format or style fine-tuning, 50-500 high-quality examples is often enough — quality beats quantity. For teaching new capabilities or domains, 1000-10000 examples is typical. For instruction-tuning a base model from scratch, 50K-1M. The biggest mistake is fine-tuning on too few low-quality examples and concluding fine-tuning 'doesn't work' — in reality the data was the problem. Invest heavily in data cleaning: consistent formatting, accurate labels, representative diversity. A well-curated 200-example dataset beats a messy 2000-example one nearly every time.

How much does fine-tuning cost?

Three cost lines. Compute: $10-$500 for LoRA on open-source models, $5-$100 for managed services like OpenAI fine-tuning (pricing varies by base model and tokens). Data preparation: often the biggest cost in staff time — curating 500 examples is days of work. Inference: fine-tuned models are typically charged at a premium vs the base model (OpenAI charges 2x output on fine-tuned models; self-hosted LoRA adds near-zero cost). For most startups the total budget to go from zero to a production fine-tune is under $1K on compute but 1-4 weeks of staff time. Plan accordingly.

What's the difference between LoRA and full fine-tuning?

Full fine-tuning updates all weights in the model — every one of billions of parameters. LoRA (Low-Rank Adaptation, Hu et al. 2021) adds small trainable matrices to a few layers and freezes the rest, typically updating under 1% of parameters. LoRA runs 10-100x cheaper, needs far less memory, and produces tiny adapter files (tens of MB) that can be swapped in/out. Quality is 95-99% of full fine-tuning on most tasks. For almost every startup use case, LoRA or QLoRA (quantized LoRA, even cheaper) is the right choice. Full fine-tuning is reserved for teams training foundation models from scratch or doing substantial base-model capability work.

Do fine-tuned models still need prompts?

Yes, and getting the prompt right matters more than people expect. Fine-tuning teaches the model to respond to a particular input format; you need to keep using that format at inference. A model fine-tuned with a 'system prompt: you are a customer support agent' prefix will underperform when called without that prefix. Keep your fine-tuning input format and your production input format identical. Also: fine-tuning doesn't eliminate hallucination — it makes the model more confident within your training distribution, which can actually increase hallucination on out-of-distribution queries. Retain evals that test edge cases.

Run your one-person company.

Hire your AI team in 30 seconds. Start for free.

Free to start · No credit card required · Set up in 30 seconds