What is LLM Fine-Tuning?
When prompting isn't enough — teach the model directly.
Fine-tuning is the process of continuing to train a pretrained LLM on a smaller, task-specific dataset so the model internalizes a style, domain, or output format. It updates model weights, unlike prompting which leaves the model unchanged. Modern fine-tuning uses parameter-efficient methods like LoRA to update a small fraction of weights, making it affordable at $10-$500 per run rather than the millions required for pretraining.
Fine-tuning is the process of continuing to train a pretrained LLM on a smaller, task-specific dataset so the model internalizes a style, domain, or output format. It updates model weights, unlike prompting which leaves the model unchanged. Modern fine-tuning uses parameter-efficient methods like LoRA to update a small fraction of weights, making it affordable at $10-$500 per run rather than the millions required for pretraining.
In depth
Examples
- →GPT-3.5-turbo fine-tuned on 500 customer-support conversation examples to match a company's tone
- →Llama 3.1 70B + LoRA trained on 2000 legal briefs to produce style-matched first drafts
- →Code model fine-tuned on a proprietary codebase to enforce internal naming and architecture conventions
- →Classification fine-tune: 5000 labeled examples of spam vs real, trained on a small base model for cheap inference
- →DPO fine-tune on 3000 preference pairs ('reply A is better than reply B') to improve helpfulness without RLHF infrastructure
- →Medical fine-tune: PubMedQA + MedQA training gives a base model clinical reasoning chops it didn't have generically
- →Anthropic's Constitutional AI — a specific form of RLHF/RLAIF fine-tuning used to align Claude's behavior
Related terms
Frequently asked questions
Should I fine-tune or use RAG?
RAG for facts, fine-tuning for behavior. If the problem is 'the model doesn't know my company's pricing page' — use RAG. If the problem is 'the model writes in the wrong tone, produces the wrong format, or can't follow my API grammar' — fine-tune. Most production systems use both. Fine-tune the model to speak in your style and produce your output format, then use RAG to inject current facts. Don't fine-tune on facts that change weekly; the model will drift and you'll have to retrain constantly. Don't use RAG for stylistic constraints; it's unreliable and expensive.
How much data do I need to fine-tune?
Depends on the task. For output-format or style fine-tuning, 50-500 high-quality examples is often enough — quality beats quantity. For teaching new capabilities or domains, 1000-10000 examples is typical. For instruction-tuning a base model from scratch, 50K-1M. The biggest mistake is fine-tuning on too few low-quality examples and concluding fine-tuning 'doesn't work' — in reality the data was the problem. Invest heavily in data cleaning: consistent formatting, accurate labels, representative diversity. A well-curated 200-example dataset beats a messy 2000-example one nearly every time.
How much does fine-tuning cost?
Three cost lines. Compute: $10-$500 for LoRA on open-source models, $5-$100 for managed services like OpenAI fine-tuning (pricing varies by base model and tokens). Data preparation: often the biggest cost in staff time — curating 500 examples is days of work. Inference: fine-tuned models are typically charged at a premium vs the base model (OpenAI charges 2x output on fine-tuned models; self-hosted LoRA adds near-zero cost). For most startups the total budget to go from zero to a production fine-tune is under $1K on compute but 1-4 weeks of staff time. Plan accordingly.
What's the difference between LoRA and full fine-tuning?
Full fine-tuning updates all weights in the model — every one of billions of parameters. LoRA (Low-Rank Adaptation, Hu et al. 2021) adds small trainable matrices to a few layers and freezes the rest, typically updating under 1% of parameters. LoRA runs 10-100x cheaper, needs far less memory, and produces tiny adapter files (tens of MB) that can be swapped in/out. Quality is 95-99% of full fine-tuning on most tasks. For almost every startup use case, LoRA or QLoRA (quantized LoRA, even cheaper) is the right choice. Full fine-tuning is reserved for teams training foundation models from scratch or doing substantial base-model capability work.
Do fine-tuned models still need prompts?
Yes, and getting the prompt right matters more than people expect. Fine-tuning teaches the model to respond to a particular input format; you need to keep using that format at inference. A model fine-tuned with a 'system prompt: you are a customer support agent' prefix will underperform when called without that prefix. Keep your fine-tuning input format and your production input format identical. Also: fine-tuning doesn't eliminate hallucination — it makes the model more confident within your training distribution, which can actually increase hallucination on out-of-distribution queries. Retain evals that test edge cases.
Run your one-person company.
Hire your AI team in 30 seconds. Start for free.
Free to start · No credit card required · Set up in 30 seconds