Cost & Latency

LLM economics decide whether a feature ships. A great feature that costs $5 per use is broken; a useful feature that responds in 30 seconds is broken. This article is about getting cost and latency under control without giving up quality.

Why cost is hard

LLM cost is a function of:

Tokens in (input).
Tokens out (output).
Model choice.
Frequency of calls.
Hidden tokens (reasoning, tool definitions, system prompts).

Innocent-looking choices can 100× the bill:

A 10k-token system prompt repeated on every call.
A debug flag accidentally left on, doubling tokens.
A retry loop hitting 5× cost per failed call.
A reasoning model on a task that didn’t need one.

Pricing intuition (early 2026)

Rough order-of-magnitude:

Tier	Input $/1M tok	Output $/1M tok	Examples
Frontier	$5–$30	$15–$120	Opus, GPT-5, Gemini Ultra
Mid	$1–$5	$3–$15	Sonnet, GPT-4o, Gemini Pro
Cheap	$0.10–$1	$0.30–$3	Haiku, GPT-4o-mini, Gemini Flash
Local self-hosted	depends on hardware		LLaMA, Mistral, Qwen

Output tokens are typically 3–4× the price of input. Long completions are expensive.

For reasoning models, internal reasoning tokens count as output — a query “thinking” for 10k tokens is paying output rate on those.

Latency intuition

For a typical chat-completion call:

Time to first token (TTFT): 100–500ms (frontier APIs), faster with prompt caching.
Tokens per second (TPS): 30–100 for frontier, 100–500 for smaller models or specialty hardware (Groq, Cerebras).
Total latency for 500 output tokens: 1–10 seconds.

For reasoning models with long thinking traces: 10–60+ seconds is common.

Reducing tokens

The single biggest cost lever.

Trim system prompts

Common waste:

Long, unfocused instructions (“be helpful, be polite, be concise, be honest, …”).
Verbose few-shot examples that could be shorter.
Tool definitions for tools never used.
Repeated explanations.

A focused 500-token system prompt often outperforms a 5000-token sprawling one.

Use prompt caching

Anthropic, OpenAI, Gemini all support caching of static prompt prefixes. ~10% of original cost on cache hits.

Structure prompts so static content is at the front.

[STATIC: system prompt, tool defs, RAG context that's reused]   <- cached
[DYNAMIC: current user message]                                  <- not cached

For long-context apps (load a doc, ask many questions), caching is transformative.

Trim retrieval context

If you retrieve 10 chunks but only 2 matter, you’re paying for 8. Tighten:

Better retrieval / reranking.
Dynamic k (fewer for clear queries, more for ambiguous).
Compact context: extract relevant sentences, not full chunks.

Shorter outputs

Tell the model to be concise.
Use schemas that limit output.
Hard-cap with max_tokens.

Right-sizing the model

Don’t use Opus when Haiku works.

Tiered routing

Cheap-then-expensive:

def respond(query):
    answer = haiku.complete(query)
    if classify_confidence(answer) < threshold:
        answer = sonnet.complete(query)
    return answer

Or topic-based routing:

def respond(query):
    if classify_topic(query) in HARD_TOPICS:
        return sonnet.complete(query)
    return haiku.complete(query)

Often saves 60–80% of cost without quality loss.

Distillation

Train a small fine-tuned model to match a frontier model on your task. 10–100× cheaper inference; comparable quality on the trained distribution.

Self-hosted

At high volume, self-hosting beats APIs:

$5K/month rented GPU vs. $50K/month API for the same throughput.
Break-even is roughly hundreds of millions of tokens/month.

Reducing latency

Streaming

Show tokens as they arrive. For chat, this is the difference between “broken” and “great” UX.

for chunk in client.stream(...):
    yield chunk.delta.text

Reduces perceived latency dramatically even when total latency is the same.

Smaller models / specialty hardware

For latency-critical paths:

Smaller models = faster.
Groq, Cerebras, SambaNova offer LLM inference at hundreds of TPS — sometimes 5–10× faster than mainstream providers.

Parallel requests

If you need 5 independent generations, make them concurrent:

results = await asyncio.gather(*[llm.complete(prompt) for prompt in prompts])

Speculative decoding

A small “draft” model proposes tokens; the big model verifies in parallel. 2–3× speedup with no quality change.

vLLM and other inference engines support this with minimal config.

Caching at the result level

For exact-match queries, cache full responses. Common in B2B FAQ-style apps.

For semantic-match: embed the query; if a similar past query has a cached answer, reuse.

similar = cache.search(embed(query), threshold=0.95)
if similar:
    return cached_response[similar.id]

Watch for: stale answers, leakage across users, false-positive reuse.

Prefix caching (vLLM)

Self-hosted inference engines cache the KV computation of prompt prefixes. Same idea as API prompt caching, server-side.

Reducing tool / RAG latency

For agents and RAG:

Parallel tool calls: when independent, call simultaneously.
Cached retrieval: if many queries use the same retrievals, cache.
Async pipelines: pre-fetch likely-needed context.
Pre-compute embeddings once at index time.

Batching

For batch use cases (offline eval, bulk processing), use batch APIs:

OpenAI Batch API: 50% off at 24-hour SLA.
Anthropic Message Batches: 50% off at 24h SLA.
Self-hosted: process queues with continuous batching (vLLM does this natively).

Cost of reasoning models

Reasoning is expensive — internal thinking tokens are paid. A query that “thinks” for 10k tokens before producing 200 visible tokens charges for 10,200 tokens.

When NOT to use reasoning:

Simple lookups.
Tasks where regular models do fine.
Latency-critical paths.

When to use reasoning:

Multi-step math, code, logic.
Tasks that benefit from search-and-verify.
High-stakes single queries where cost is acceptable.

Cost monitoring

We covered this in observability-and-tracing.md. Practical:

Per-feature attribution: tag requests; aggregate cost.
Anomaly alerts: cost > 2× last week → investigate.
Per-customer cost: critical for B2B with usage tiers.
Pre-launch projections: estimate cost at expected scale before launching.

Practical playbook

For a new LLM feature, in order:

Build with the right model. Don’t over- or under-buy.
Cache static prefixes (prompt caching).
Trim system prompts.
Stream output.
Cache exact-match responses for repeat queries.
Set cost and latency alerts.
Measure before optimizing.
At scale, consider self-hosted + distillation.

Pitfalls

Over-engineering before measurement: optimizing latency that’s already fine.
Forgetting reasoning token cost.
Cache-key bugs: cache hits across users when they shouldn’t.
Stale caches: old prompts cached after a system prompt change.
Hidden retries: timeout handling that retries 3× silently.

Watch it interactively

Cost & Latency Calculator — model picker + token sliders + optimization toggle (prompt cache, batching, speculative decoding, quantization). Predict before clicking: at 2000 input + 400 output tokens, switching from Opus to Haiku drops $/req ~10×. Adding prompt cache to Sonnet drops it another ~2× when 50% of the prefix is shared.
Quantization Lab — slide bits from 16 → 8 → 4 → 2; watch RMSE rise and memory drop. The math (q = round(v/scale + zero)) is on the page.
KV Cache — toggle caching off; watch per-step compute go quadratic.
Observability Trace — perturbation toggle showing what “slow LLM” or “extra retry” looks like in the trace.

Build it in code

/ship/14 — cost and latency tuning — five levers ranked by ROI (prompt cache, prefix cache, continuous batching, AWQ quantization, speculative decoding). Real benchmark table: 4× cheaper, 4× faster, quality flat.
/case-studies/04 — customer-support bot — a real product running with all these levers applied; the ROI math (~$23K/month net for 5K conversations) is the payoff.