Learn

What is an LLM Token?

The atomic unit of LLM cost, context, and speed.

A token is the unit of text an LLM processes — typically a subword produced by the model's tokenizer. For English, one token averages about 0.75 words or 4 characters. A 1000-token prompt is roughly 750 words. Everything about LLMs is measured in tokens: context window size, API pricing, throughput, latency. Understanding tokens is the difference between reasoning about LLMs correctly and burning money unnecessarily.

Free to startNo credit card requiredUpdated Apr 2026
Short answer

A token is the unit of text an LLM processes — typically a subword produced by the model's tokenizer. For English, one token averages about 0.75 words or 4 characters. A 1000-token prompt is roughly 750 words. Everything about LLMs is measured in tokens: context window size, API pricing, throughput, latency. Understanding tokens is the difference between reasoning about LLMs correctly and burning money unnecessarily.

In depth

An LLM doesn't read text directly. A tokenizer first chops the input into a sequence of tokens drawn from a vocabulary of 50,000 to 200,000 entries. Each token is an integer ID the model processes. OpenAI's cl100k (GPT-4-era) and o200k (GPT-4o/5) tokenizers use byte pair encoding (BPE), which learns common subwords from training data. 'The cat sat' becomes [464, 3797, 3332] — three tokens. 'unbelievable' becomes [403, 6667, 11203] — three tokens. Rare or non-English text uses more tokens per character. Tokens matter for three reasons. (1) Context window: every model has a maximum number of tokens it can consider in one call. Claude 4.5 supports 200K tokens; Gemini 2.5 up to 2M; GPT-5 up to 400K depending on tier. Going over the limit truncates or fails. 200K tokens is about 500 pages of text. (2) Pricing: every commercial API bills by input and output tokens separately. As of 2026, Claude 4.5 Sonnet is around $3/$15 per million tokens (input/output), GPT-5 around $5/$15, DeepSeek V3 around $0.27/$1.10. Output tokens are 3-5x more expensive than input tokens on most providers because they're more compute-intensive. (3) Latency: time-to-first-token depends on prompt length (the model must process all input before starting output); throughput after that depends on output token count. A 10000-input/200-output call is fast; a 200-input/2000-output call is slow. Tokenization has subtle traps. Different providers use different tokenizers, so the same text is a different number of tokens on Claude vs GPT vs Gemini. Non-English is dramatically more expensive token-wise — Chinese text can be 2-3x more tokens than the equivalent English. Code and JSON are token-heavy because symbols and punctuation each take a token. When budgeting for an agent app, count tokens with the actual tokenizer you'll use (tiktoken for OpenAI, anthropic's tokenizer endpoint for Claude), not rough word counts. Common token math: 1K tokens ≈ 750 English words ≈ 1 page of prose. A typical 5-minute conversation with Claude is 2-5K tokens. A RAG query with 10 retrieved chunks is 5-15K tokens input. Summarizing a 50-page PDF is 40-60K input + 2-3K output. Daily usage for an active AI-employee user typically runs 100K-500K tokens per day. At Claude 4.5 Sonnet prices that's roughly $1-$5 per day per active user, which is why managing token budgets is central to unit economics. Optimizations: (1) Prompt caching — Anthropic and OpenAI both support caching large static prefixes so you pay full price only the first time. For RAG systems with big system prompts, this can cut costs 80%+. (2) Context compression — summarize old conversation turns into short paragraphs rather than sending verbatim. (3) Model routing — use cheap models (Haiku, GPT-5 mini, DeepSeek) for routine tasks and reserve expensive ones for hard reasoning. (4) Output length caps — many prompts produce answers 5x longer than needed; 'answer in 50 words' saves half the output cost. For Tycoon, tokens are the unit of business. Every call to Astra costs tokens; every RAG retrieval costs tokens; every tool call costs tokens. The product economics depend on token efficiency. We aggressively cache system prompts, route to cheaper models for low-complexity tasks, and compress old conversation context. Users never see 'tokens' in the UI — but the unit is doing the work underneath.

Examples

  • 'The quick brown fox' = 4 tokens in cl100k, often 4-5 in other tokenizers
  • 'unbelievable' = 3 tokens (un-believ-able) — compound words split into subwords
  • Chinese '你好世界' (hello world) = 6-8 tokens depending on tokenizer — 2x English overhead
  • A 100-line Python function = 400-800 tokens; symbols and whitespace each consume tokens
  • OpenAI's gpt-4o o200k tokenizer is more efficient than cl100k for non-English — up to 4x fewer tokens for Chinese
  • Prompt caching on Anthropic: 10K-token system prompt cached once, reused across thousands of calls at 10% cost
  • A full Claude 4.5 context window (200K tokens) = roughly 150K words = 500 pages of text
  • Tycoon Astra's average chat turn: 3K input tokens (system + memory + current message) + 500 output tokens

Related terms

Frequently asked questions

How do I count tokens before making an API call?

Use the provider's tokenizer. For OpenAI, the tiktoken Python library is the reference. For Anthropic, the `/v1/messages/count_tokens` endpoint counts exactly. For open-source models, Hugging Face's transformers library has the tokenizer bundled. Word-count-based estimates (like 'tokens ≈ words × 1.33') are rough — off by 10-30% for anything non-English or symbol-heavy. For production, always count with the real tokenizer. A common bug is estimating based on English prose, shipping the feature, and blowing the budget when users send Chinese or code.

Why are output tokens more expensive than input tokens?

Because generating output tokens requires running the model autoregressively — one forward pass per output token, each dependent on the previous. Input tokens are processed in parallel in a single pass. Output throughput is the bottleneck; GPU time per output token is roughly 3-5x per input token. Providers pass this through in pricing. Implication: long prompts are usually cheap, long outputs are expensive. Optimize by constraining output length ('answer in 3 bullets'), not by shortening prompts. RAG systems that stuff 10K tokens of context in exchange for a 300-token answer are well-designed; answering 3K tokens when 300 would do is wasted money.

How big is a typical context window now?

As of early 2026: Claude 4.5 Sonnet 200K, Claude 4.5 Opus 200K, Claude 4.5 with 1M tier available. GPT-5 400K. Gemini 2.5 up to 2M. DeepSeek V3 128K. Open-source Llama 3.1 405B is 128K. These are practical maximums; actual performance degrades as context fills due to 'lost in the middle' effects. Most production systems use 5K-50K of actual context per call even when the limit is higher. Large context is useful for document analysis and long-running conversations but is expensive and slower than small context. RAG with short contexts usually beats naive large-context use.

What's prompt caching and how much does it save?

Prompt caching lets you mark a prefix of your prompt (system message, RAG context, few-shot examples) as cacheable. On the first call the provider processes it normally; on subsequent calls within a cache TTL (5 minutes for Anthropic by default) the cached prefix is reused at a fraction of the price — 10% for Anthropic, 50% for OpenAI, both usually with a small first-time write cost. For systems with a large static prefix that gets reused across many user queries (customer support agents, RAG chatbots), caching cuts input costs 70-90%. It's the single biggest cost optimization in production LLM apps. Tycoon aggressively caches Astra's system prompt and loaded skills.

Why does the same text have different token counts on different models?

Because each model family uses its own tokenizer with its own vocabulary. OpenAI's o200k differs from cl100k; Anthropic has its own; Google's Gemini uses SentencePiece; Meta's Llama uses yet another. The same 100-word English paragraph might be 130 tokens on GPT-5 but 150 on Claude and 120 on Gemini. Token counts are roughly in the same ballpark for English but diverge significantly for non-English, code, or unusual formatting. For cost comparisons across providers, always convert to dollars using each provider's own tokenizer — comparing 'tokens' across providers is meaningless without that conversion.

Run your one-person company.

Hire your AI team in 30 seconds. Start for free.

Free to start · No credit card required · Set up in 30 seconds