What is an LLM Token?
The atomic unit of LLM cost, context, and speed.
A token is the unit of text an LLM processes — typically a subword produced by the model's tokenizer. For English, one token averages about 0.75 words or 4 characters. A 1000-token prompt is roughly 750 words. Everything about LLMs is measured in tokens: context window size, API pricing, throughput, latency. Understanding tokens is the difference between reasoning about LLMs correctly and burning money unnecessarily.
A token is the unit of text an LLM processes — typically a subword produced by the model's tokenizer. For English, one token averages about 0.75 words or 4 characters. A 1000-token prompt is roughly 750 words. Everything about LLMs is measured in tokens: context window size, API pricing, throughput, latency. Understanding tokens is the difference between reasoning about LLMs correctly and burning money unnecessarily.
In depth
Examples
- →'The quick brown fox' = 4 tokens in cl100k, often 4-5 in other tokenizers
- →'unbelievable' = 3 tokens (un-believ-able) — compound words split into subwords
- →Chinese '你好世界' (hello world) = 6-8 tokens depending on tokenizer — 2x English overhead
- →A 100-line Python function = 400-800 tokens; symbols and whitespace each consume tokens
- →OpenAI's gpt-4o o200k tokenizer is more efficient than cl100k for non-English — up to 4x fewer tokens for Chinese
- →Prompt caching on Anthropic: 10K-token system prompt cached once, reused across thousands of calls at 10% cost
- →A full Claude 4.5 context window (200K tokens) = roughly 150K words = 500 pages of text
- →Tycoon Astra's average chat turn: 3K input tokens (system + memory + current message) + 500 output tokens
Related terms
Frequently asked questions
How do I count tokens before making an API call?
Use the provider's tokenizer. For OpenAI, the tiktoken Python library is the reference. For Anthropic, the `/v1/messages/count_tokens` endpoint counts exactly. For open-source models, Hugging Face's transformers library has the tokenizer bundled. Word-count-based estimates (like 'tokens ≈ words × 1.33') are rough — off by 10-30% for anything non-English or symbol-heavy. For production, always count with the real tokenizer. A common bug is estimating based on English prose, shipping the feature, and blowing the budget when users send Chinese or code.
Why are output tokens more expensive than input tokens?
Because generating output tokens requires running the model autoregressively — one forward pass per output token, each dependent on the previous. Input tokens are processed in parallel in a single pass. Output throughput is the bottleneck; GPU time per output token is roughly 3-5x per input token. Providers pass this through in pricing. Implication: long prompts are usually cheap, long outputs are expensive. Optimize by constraining output length ('answer in 3 bullets'), not by shortening prompts. RAG systems that stuff 10K tokens of context in exchange for a 300-token answer are well-designed; answering 3K tokens when 300 would do is wasted money.
How big is a typical context window now?
As of early 2026: Claude 4.5 Sonnet 200K, Claude 4.5 Opus 200K, Claude 4.5 with 1M tier available. GPT-5 400K. Gemini 2.5 up to 2M. DeepSeek V3 128K. Open-source Llama 3.1 405B is 128K. These are practical maximums; actual performance degrades as context fills due to 'lost in the middle' effects. Most production systems use 5K-50K of actual context per call even when the limit is higher. Large context is useful for document analysis and long-running conversations but is expensive and slower than small context. RAG with short contexts usually beats naive large-context use.
What's prompt caching and how much does it save?
Prompt caching lets you mark a prefix of your prompt (system message, RAG context, few-shot examples) as cacheable. On the first call the provider processes it normally; on subsequent calls within a cache TTL (5 minutes for Anthropic by default) the cached prefix is reused at a fraction of the price — 10% for Anthropic, 50% for OpenAI, both usually with a small first-time write cost. For systems with a large static prefix that gets reused across many user queries (customer support agents, RAG chatbots), caching cuts input costs 70-90%. It's the single biggest cost optimization in production LLM apps. Tycoon aggressively caches Astra's system prompt and loaded skills.
Why does the same text have different token counts on different models?
Because each model family uses its own tokenizer with its own vocabulary. OpenAI's o200k differs from cl100k; Anthropic has its own; Google's Gemini uses SentencePiece; Meta's Llama uses yet another. The same 100-word English paragraph might be 130 tokens on GPT-5 but 150 on Claude and 120 on Gemini. Token counts are roughly in the same ballpark for English but diverge significantly for non-English, code, or unusual formatting. For cost comparisons across providers, always convert to dollars using each provider's own tokenizer — comparing 'tokens' across providers is meaningless without that conversion.
Run your one-person company.
Hire your AI team in 30 seconds. Start for free.
Free to start · No credit card required · Set up in 30 seconds