What is Inference Cost?
Why your AI bill adds up — the economics of running LLMs at scale.
Inference cost is the per-token price of running a trained LLM to generate outputs, typically billed separately for input (prompt) tokens and output (response) tokens. In 2026 it ranges from ~$0.10 per million tokens for small open-source models up to $75 per million output tokens for frontier proprietary models. Inference cost — not training cost — dominates the economics of production AI applications at scale.
Inference cost is the per-token price of running a trained LLM to generate outputs, typically billed separately for input (prompt) tokens and output (response) tokens. In 2026 it ranges from ~$0.10 per million tokens for small open-source models up to $75 per million output tokens for frontier proprietary models. Inference cost — not training cost — dominates the economics of production AI applications at scale.
In depth
Examples
- →Anthropic Claude 4.5 Sonnet — $3 per million input tokens, $15 per million output tokens as of early 2026; most common workhorse model
- →Anthropic Claude Opus 4.5 — $15 per million input, $75 per million output; used for hardest tasks
- →OpenAI GPT-5 family — tiered pricing across GPT-5 Pro, GPT-5, and GPT-5 Mini similar to Anthropic structure
- →Google Gemini 2.5 Pro — ~$1.25 per million input, $10 per million output; competitive on price for workhorse tier
- →DeepSeek V3.2 via API providers — ~$0.27/$1.10 per million; strong open-source alternative at ~10x cheaper
- →Prompt caching discounts — Anthropic and OpenAI both offer 90% cache-hit discount on repeated prefixes
- →Batch APIs — OpenAI Batch and Anthropic Message Batches cut costs 50% for non-latency-sensitive workloads
Related terms
Frequently asked questions
Why are output tokens more expensive than input tokens?
Output generation is serial — the model predicts one token at a time, with each token requiring a full forward pass through the network. Input tokens can be processed in parallel in a single forward pass. So a 10K-token input is cheap to process, but generating a 10K-token output takes 10K sequential steps and proportionally more compute and latency. Providers price accordingly, typically charging 3-5x more per output token than input token.
How do I reduce inference costs in production?
Six main levers, in order of typical impact. (1) Prompt caching — for static prefixes, saves 5-10x on input tokens. (2) RAG over long context — retrieve only what's needed, not everything. (3) Model routing — cheap model for easy queries, expensive for hard. (4) Batch APIs — 50% off for async work. (5) Output length control — ask for concise responses; a prompt saying 'respond in under 100 words' cuts output tokens dramatically. (6) Consider smaller models — Haiku 4.5, GPT-5 Mini, Gemini 2.5 Flash handle many tasks at 5-20x cheaper than frontier models.
What's the cost difference between ChatGPT Plus and raw API inference?
ChatGPT Plus ($20/month) gives you ~150 messages every few hours on GPT-5 — very roughly $0.04-0.15 per message depending on length. Raw API at GPT-5 pricing would cost similar or less per message, but the Plus plan bundles unlimited usage under a cap with no surprise bills. For high-volume users, API access plus bring-your-own-client tools like Claude Code or Cursor often ends up cheaper per message than the consumer subscription, while being more flexible. Tycoon bills at a flat subscription ($Y/month) that bundles inference cost for normal usage.
Are open-source models cheaper than API models?
On paper yes, in practice it depends. Llama 3.3 70B or DeepSeek V3.2 via Together/Fireworks/Groq run at roughly 10-20% the cost of Claude 4.5 Sonnet. Self-hosting can be even cheaper at high volume (you pay for GPUs, not per token), but self-hosting adds ops complexity, latency variability, and capacity management that most orgs underestimate. Rule of thumb: below ~1B tokens/month, use hosted APIs (proprietary or open-source via vendors); above that, model the math on self-hosting against your specific latency and reliability needs.
How is inference cost trending?
Declining ~40-60% per year at each quality tier for the last three years, and the trend is continuing. Claude 3 Opus in 2024 cost $15/$75 per million; the equivalent quality today is Claude 4.5 Sonnet at $3/$15 — 5x cheaper in two years. The causes: architectural improvements (smaller effective models matching larger old ones), hardware improvements (H100, B200 GPUs, custom accelerators like Trainium and TPU v5), and intense competitive pricing. This makes token-hungry architectures (multi-agent systems, long-context RAG, aggressive tool use) increasingly affordable year over year.
Run your one-person company.
Hire your AI team in 30 seconds. Start for free.
Free to start · No credit card required · Set up in 30 seconds