Training a frontier LLM costs hundreds of millions to billions of dollars but happens once. Inference happens trillions of times — every ChatGPT query, every Copilot suggestion, every agent turn. At scale, inference is the dominant line item. This is why model providers compete aggressively on inference pricing and why every production AI architecture eventually becomes an exercise in token optimization.
Typical 2026 pricing tiers. Frontier tier: Claude Opus 4.5 ($15/$75 per million input/output tokens), GPT-5 Pro (similar range). Used for hardest reasoning, research tasks, critical decisions. Workhorse tier: Claude 4.5 Sonnet ($3/$15), GPT-5 ($3-5/$15-20), Gemini 2.5 Pro ($1.25/$10). The default for production applications — strong quality at reasonable cost. Cheap tier: Claude Haiku 4.5, GPT-5 Mini, Gemini 2.5 Flash — $0.15-0.50/$0.60-2.50. Used for high-volume simple tasks. Open-source inference: DeepSeek V3.2, Llama 3.3 70B, Qwen 2.5 — $0.10-0.40/$0.30-1.50 through API providers like Together, Groq, Fireworks. These prices are declining ~50% per year — what's expensive today will be routine next year.
Input vs output pricing matters. Output tokens typically cost 3-5x more than input tokens because generating each one is serial — the model predicts token by token, so latency and compute are tied to output length. Input tokens can be processed in parallel and are cheaper. This has architectural consequences: if you're choosing between 'send 100K tokens of context to get a short answer' versus 'send a short prompt to get a long answer', the first is usually cheaper and faster.
Prompt caching changes the equation. Anthropic, OpenAI, and Google offer caching for static prefixes: repeated content (system prompts, document corpora, tool definitions) cached on the provider's side costs ~10-25% of normal input tokens and has much lower latency. For agents that hit the same large system prompt on every turn, this is a 5-10x cost reduction. Any production agent architecture should aggressively structure prompts to maximize cache hits.
Batch inference is the other big saver. Most providers offer batch APIs (OpenAI Batch, Anthropic Message Batches) that run asynchronously and cost 50% of normal. If your work isn't latency-sensitive — overnight processing, offline analysis, bulk content generation — batch can cut bills in half.
The 'token math' of a production agent. A typical Tycoon
AI employee turn: ~5-15K cached system prompt (billed at cache-hit rate), ~2-10K retrieved project memory, ~500-3000 tokens of conversation history, ~500-3000 tokens of output, sometimes 1-5 tool calls each adding their own tokens. Using Claude 4.5 Sonnet with caching, a typical turn costs $0.02-0.15. A founder having 30 AI-employee interactions per day costs $0.60-4.50/day — well under the $20-100/month pricing tier. Scale to a thousand founders, thousands of interactions per day, and the arithmetic quickly explains why production AI economics demand continuous token optimization.
Multi-model routing — sending easy queries to cheap models and hard ones to expensive ones — can cut bills 3-10x for suitable workloads. Tools like Portkey, OpenRouter, and LiteLLM facilitate routing. Some applications use a 'small model first, escalate if uncertain' pattern to minimize frontier-model calls. For long-context workloads, caching plus batch plus routing can collectively cut costs 5-20x versus naive implementations.
Tycoon's architecture assumes inference cost is the main unit-economics driver. Each AI employee uses a tiered model strategy — Haiku-class for quick classifications, Sonnet-class for most work, Opus-class for hardest planning and review. System prompts and project memory are aggressively cached. Most work happens at Sonnet pricing with Opus calls reserved for true judgment moments, keeping per-employee inference costs within the subscription margin.