Cost & Latency
LLM economics decide whether a feature ships. A great feature that costs $5 per use is broken; a useful feature that responds in 30 seconds is broken. This article is about getting cost and latency under control without giving up quality.
Why cost is hard
LLM cost is a function of:
- Tokens in (input).
- Tokens out (output).
- Model choice.
- Frequency of calls.
- Hidden tokens (reasoning, tool definitions, system prompts).
Innocent-looking choices can 100× the bill:
- A 10k-token system prompt repeated on every call.
- A debug flag accidentally left on, doubling tokens.
- A retry loop hitting 5× cost per failed call.
- A reasoning model on a task that didn’t need one.
Pricing intuition (early 2026)
Rough order-of-magnitude:
| Tier | Input $/1M tok | Output $/1M tok | Examples |
|---|---|---|---|
| Frontier | $5–$30 | $15–$120 | Opus, GPT-5, Gemini Ultra |
| Mid | $1–$5 | $3–$15 | Sonnet, GPT-4o, Gemini Pro |
| Cheap | $0.10–$1 | $0.30–$3 | Haiku, GPT-4o-mini, Gemini Flash |
| Local self-hosted | depends on hardware | LLaMA, Mistral, Qwen |
Output tokens are typically 3–4× the price of input. Long completions are expensive.
For reasoning models, internal reasoning tokens count as output — a query “thinking” for 10k tokens is paying output rate on those.
Latency intuition
For a typical chat-completion call:
- Time to first token (TTFT): 100–500ms (frontier APIs), faster with prompt caching.
- Tokens per second (TPS): 30–100 for frontier, 100–500 for smaller models or specialty hardware (Groq, Cerebras).
- Total latency for 500 output tokens: 1–10 seconds.
For reasoning models with long thinking traces: 10–60+ seconds is common.
Reducing tokens
The single biggest cost lever.
Trim system prompts
Common waste:
- Long, unfocused instructions (“be helpful, be polite, be concise, be honest, …”).
- Verbose few-shot examples that could be shorter.
- Tool definitions for tools never used.
- Repeated explanations.
A focused 500-token system prompt often outperforms a 5000-token sprawling one.
Use prompt caching
Anthropic, OpenAI, Gemini all support caching of static prompt prefixes. ~10% of original cost on cache hits.
Structure prompts so static content is at the front.
[STATIC: system prompt, tool defs, RAG context that's reused] <- cached
[DYNAMIC: current user message] <- not cached
For long-context apps (load a doc, ask many questions), caching is transformative.
Trim retrieval context
If you retrieve 10 chunks but only 2 matter, you’re paying for 8. Tighten:
- Better retrieval / reranking.
- Dynamic k (fewer for clear queries, more for ambiguous).
- Compact context: extract relevant sentences, not full chunks.
Shorter outputs
- Tell the model to be concise.
- Use schemas that limit output.
- Hard-cap with
max_tokens.
Right-sizing the model
Don’t use Opus when Haiku works.
Tiered routing
Cheap-then-expensive:
def respond(query):
answer = haiku.complete(query)
if classify_confidence(answer) < threshold:
answer = sonnet.complete(query)
return answer
Or topic-based routing:
def respond(query):
if classify_topic(query) in HARD_TOPICS:
return sonnet.complete(query)
return haiku.complete(query)
Often saves 60–80% of cost without quality loss.
Distillation
Train a small fine-tuned model to match a frontier model on your task. 10–100× cheaper inference; comparable quality on the trained distribution.
Self-hosted
At high volume, self-hosting beats APIs:
- $5K/month rented GPU vs. $50K/month API for the same throughput.
- Break-even is roughly hundreds of millions of tokens/month.
Reducing latency
Streaming
Show tokens as they arrive. For chat, this is the difference between “broken” and “great” UX.
for chunk in client.stream(...):
yield chunk.delta.text
Reduces perceived latency dramatically even when total latency is the same.
Smaller models / specialty hardware
For latency-critical paths:
- Smaller models = faster.
- Groq, Cerebras, SambaNova offer LLM inference at hundreds of TPS — sometimes 5–10× faster than mainstream providers.
Parallel requests
If you need 5 independent generations, make them concurrent:
results = await asyncio.gather(*[llm.complete(prompt) for prompt in prompts])
Speculative decoding
A small “draft” model proposes tokens; the big model verifies in parallel. 2–3× speedup with no quality change.
vLLM and other inference engines support this with minimal config.
Caching at the result level
For exact-match queries, cache full responses. Common in B2B FAQ-style apps.
For semantic-match: embed the query; if a similar past query has a cached answer, reuse.
similar = cache.search(embed(query), threshold=0.95)
if similar:
return cached_response[similar.id]
Watch for: stale answers, leakage across users, false-positive reuse.
Prefix caching (vLLM)
Self-hosted inference engines cache the KV computation of prompt prefixes. Same idea as API prompt caching, server-side.
Reducing tool / RAG latency
For agents and RAG:
- Parallel tool calls: when independent, call simultaneously.
- Cached retrieval: if many queries use the same retrievals, cache.
- Async pipelines: pre-fetch likely-needed context.
- Pre-compute embeddings once at index time.
Batching
For batch use cases (offline eval, bulk processing), use batch APIs:
- OpenAI Batch API: 50% off at 24-hour SLA.
- Anthropic Message Batches: 50% off at 24h SLA.
- Self-hosted: process queues with continuous batching (vLLM does this natively).
Cost of reasoning models
Reasoning is expensive — internal thinking tokens are paid. A query that “thinks” for 10k tokens before producing 200 visible tokens charges for 10,200 tokens.
When NOT to use reasoning:
- Simple lookups.
- Tasks where regular models do fine.
- Latency-critical paths.
When to use reasoning:
- Multi-step math, code, logic.
- Tasks that benefit from search-and-verify.
- High-stakes single queries where cost is acceptable.
Cost monitoring
We covered this in observability-and-tracing.md. Practical:
- Per-feature attribution: tag requests; aggregate cost.
- Anomaly alerts: cost > 2× last week → investigate.
- Per-customer cost: critical for B2B with usage tiers.
- Pre-launch projections: estimate cost at expected scale before launching.
Practical playbook
For a new LLM feature, in order:
- Build with the right model. Don’t over- or under-buy.
- Cache static prefixes (prompt caching).
- Trim system prompts.
- Stream output.
- Cache exact-match responses for repeat queries.
- Set cost and latency alerts.
- Measure before optimizing.
- At scale, consider self-hosted + distillation.
Pitfalls
- Over-engineering before measurement: optimizing latency that’s already fine.
- Forgetting reasoning token cost.
- Cache-key bugs: cache hits across users when they shouldn’t.
- Stale caches: old prompts cached after a system prompt change.
- Hidden retries: timeout handling that retries 3× silently.
Watch it interactively
- Cost & Latency Calculator — model picker + token sliders + optimization toggle (prompt cache, batching, speculative decoding, quantization). Predict before clicking: at 2000 input + 400 output tokens, switching from Opus to Haiku drops $/req ~10×. Adding prompt cache to Sonnet drops it another ~2× when 50% of the prefix is shared.
- Quantization Lab — slide bits from 16 → 8 → 4 → 2; watch RMSE rise and memory drop. The math (
q = round(v/scale + zero)) is on the page. - KV Cache — toggle caching off; watch per-step compute go quadratic.
- Observability Trace — perturbation toggle showing what “slow LLM” or “extra retry” looks like in the trace.
Build it in code
/ship/14— cost and latency tuning — five levers ranked by ROI (prompt cache, prefix cache, continuous batching, AWQ quantization, speculative decoding). Real benchmark table: 4× cheaper, 4× faster, quality flat./case-studies/04— customer-support bot — a real product running with all these levers applied; the ROI math (~$23K/month net for 5K conversations) is the payoff.