Learn

What is Inference Cost?

Why your AI bill adds up — the economics of running LLMs at scale.

Inference cost is the per-token price of running a trained LLM to generate outputs, typically billed separately for input (prompt) tokens and output (response) tokens. In 2026 it ranges from ~$0.10 per million tokens for small open-source models up to $75 per million output tokens for frontier proprietary models. Inference cost — not training cost — dominates the economics of production AI applications at scale.

Free to startNo credit card requiredUpdated Apr 2026
Short answer

Inference cost is the per-token price of running a trained LLM to generate outputs, typically billed separately for input (prompt) tokens and output (response) tokens. In 2026 it ranges from ~$0.10 per million tokens for small open-source models up to $75 per million output tokens for frontier proprietary models. Inference cost — not training cost — dominates the economics of production AI applications at scale.

In depth

Training a frontier LLM costs hundreds of millions to billions of dollars but happens once. Inference happens trillions of times — every ChatGPT query, every Copilot suggestion, every agent turn. At scale, inference is the dominant line item. This is why model providers compete aggressively on inference pricing and why every production AI architecture eventually becomes an exercise in token optimization. Typical 2026 pricing tiers. Frontier tier: Claude Opus 4.5 ($15/$75 per million input/output tokens), GPT-5 Pro (similar range). Used for hardest reasoning, research tasks, critical decisions. Workhorse tier: Claude 4.5 Sonnet ($3/$15), GPT-5 ($3-5/$15-20), Gemini 2.5 Pro ($1.25/$10). The default for production applications — strong quality at reasonable cost. Cheap tier: Claude Haiku 4.5, GPT-5 Mini, Gemini 2.5 Flash — $0.15-0.50/$0.60-2.50. Used for high-volume simple tasks. Open-source inference: DeepSeek V3.2, Llama 3.3 70B, Qwen 2.5 — $0.10-0.40/$0.30-1.50 through API providers like Together, Groq, Fireworks. These prices are declining ~50% per year — what's expensive today will be routine next year. Input vs output pricing matters. Output tokens typically cost 3-5x more than input tokens because generating each one is serial — the model predicts token by token, so latency and compute are tied to output length. Input tokens can be processed in parallel and are cheaper. This has architectural consequences: if you're choosing between 'send 100K tokens of context to get a short answer' versus 'send a short prompt to get a long answer', the first is usually cheaper and faster. Prompt caching changes the equation. Anthropic, OpenAI, and Google offer caching for static prefixes: repeated content (system prompts, document corpora, tool definitions) cached on the provider's side costs ~10-25% of normal input tokens and has much lower latency. For agents that hit the same large system prompt on every turn, this is a 5-10x cost reduction. Any production agent architecture should aggressively structure prompts to maximize cache hits. Batch inference is the other big saver. Most providers offer batch APIs (OpenAI Batch, Anthropic Message Batches) that run asynchronously and cost 50% of normal. If your work isn't latency-sensitive — overnight processing, offline analysis, bulk content generation — batch can cut bills in half. The 'token math' of a production agent. A typical Tycoon AI employee turn: ~5-15K cached system prompt (billed at cache-hit rate), ~2-10K retrieved project memory, ~500-3000 tokens of conversation history, ~500-3000 tokens of output, sometimes 1-5 tool calls each adding their own tokens. Using Claude 4.5 Sonnet with caching, a typical turn costs $0.02-0.15. A founder having 30 AI-employee interactions per day costs $0.60-4.50/day — well under the $20-100/month pricing tier. Scale to a thousand founders, thousands of interactions per day, and the arithmetic quickly explains why production AI economics demand continuous token optimization. Multi-model routing — sending easy queries to cheap models and hard ones to expensive ones — can cut bills 3-10x for suitable workloads. Tools like Portkey, OpenRouter, and LiteLLM facilitate routing. Some applications use a 'small model first, escalate if uncertain' pattern to minimize frontier-model calls. For long-context workloads, caching plus batch plus routing can collectively cut costs 5-20x versus naive implementations. Tycoon's architecture assumes inference cost is the main unit-economics driver. Each AI employee uses a tiered model strategy — Haiku-class for quick classifications, Sonnet-class for most work, Opus-class for hardest planning and review. System prompts and project memory are aggressively cached. Most work happens at Sonnet pricing with Opus calls reserved for true judgment moments, keeping per-employee inference costs within the subscription margin.

Examples

  • Anthropic Claude 4.5 Sonnet — $3 per million input tokens, $15 per million output tokens as of early 2026; most common workhorse model
  • Anthropic Claude Opus 4.5 — $15 per million input, $75 per million output; used for hardest tasks
  • OpenAI GPT-5 family — tiered pricing across GPT-5 Pro, GPT-5, and GPT-5 Mini similar to Anthropic structure
  • Google Gemini 2.5 Pro — ~$1.25 per million input, $10 per million output; competitive on price for workhorse tier
  • DeepSeek V3.2 via API providers — ~$0.27/$1.10 per million; strong open-source alternative at ~10x cheaper
  • Prompt caching discounts — Anthropic and OpenAI both offer 90% cache-hit discount on repeated prefixes
  • Batch APIs — OpenAI Batch and Anthropic Message Batches cut costs 50% for non-latency-sensitive workloads

Related terms

Frequently asked questions

Why are output tokens more expensive than input tokens?

Output generation is serial — the model predicts one token at a time, with each token requiring a full forward pass through the network. Input tokens can be processed in parallel in a single forward pass. So a 10K-token input is cheap to process, but generating a 10K-token output takes 10K sequential steps and proportionally more compute and latency. Providers price accordingly, typically charging 3-5x more per output token than input token.

How do I reduce inference costs in production?

Six main levers, in order of typical impact. (1) Prompt caching — for static prefixes, saves 5-10x on input tokens. (2) RAG over long context — retrieve only what's needed, not everything. (3) Model routing — cheap model for easy queries, expensive for hard. (4) Batch APIs — 50% off for async work. (5) Output length control — ask for concise responses; a prompt saying 'respond in under 100 words' cuts output tokens dramatically. (6) Consider smaller models — Haiku 4.5, GPT-5 Mini, Gemini 2.5 Flash handle many tasks at 5-20x cheaper than frontier models.

What's the cost difference between ChatGPT Plus and raw API inference?

ChatGPT Plus ($20/month) gives you ~150 messages every few hours on GPT-5 — very roughly $0.04-0.15 per message depending on length. Raw API at GPT-5 pricing would cost similar or less per message, but the Plus plan bundles unlimited usage under a cap with no surprise bills. For high-volume users, API access plus bring-your-own-client tools like Claude Code or Cursor often ends up cheaper per message than the consumer subscription, while being more flexible. Tycoon bills at a flat subscription ($Y/month) that bundles inference cost for normal usage.

Are open-source models cheaper than API models?

On paper yes, in practice it depends. Llama 3.3 70B or DeepSeek V3.2 via Together/Fireworks/Groq run at roughly 10-20% the cost of Claude 4.5 Sonnet. Self-hosting can be even cheaper at high volume (you pay for GPUs, not per token), but self-hosting adds ops complexity, latency variability, and capacity management that most orgs underestimate. Rule of thumb: below ~1B tokens/month, use hosted APIs (proprietary or open-source via vendors); above that, model the math on self-hosting against your specific latency and reliability needs.

How is inference cost trending?

Declining ~40-60% per year at each quality tier for the last three years, and the trend is continuing. Claude 3 Opus in 2024 cost $15/$75 per million; the equivalent quality today is Claude 4.5 Sonnet at $3/$15 — 5x cheaper in two years. The causes: architectural improvements (smaller effective models matching larger old ones), hardware improvements (H100, B200 GPUs, custom accelerators like Trainium and TPU v5), and intense competitive pricing. This makes token-hungry architectures (multi-agent systems, long-context RAG, aggressive tool use) increasingly affordable year over year.

Run your one-person company.

Hire your AI team in 30 seconds. Start for free.

Free to start · No credit card required · Set up in 30 seconds