Cost & Latency

LLM economics decide whether a feature ships. A great feature that costs $5 per use is broken; a useful feature that responds in 30 seconds is broken. This article is about getting cost and latency under control without giving up quality.

Why cost is hard

LLM cost is a function of:

  • Tokens in (input).
  • Tokens out (output).
  • Model choice.
  • Frequency of calls.
  • Hidden tokens (reasoning, tool definitions, system prompts).

Innocent-looking choices can 100× the bill:

  • A 10k-token system prompt repeated on every call.
  • A debug flag accidentally left on, doubling tokens.
  • A retry loop hitting 5× cost per failed call.
  • A reasoning model on a task that didn’t need one.

Pricing intuition (early 2026)

Rough order-of-magnitude:

TierInput $/1M tokOutput $/1M tokExamples
Frontier$5–$30$15–$120Opus, GPT-5, Gemini Ultra
Mid$1–$5$3–$15Sonnet, GPT-4o, Gemini Pro
Cheap$0.10–$1$0.30–$3Haiku, GPT-4o-mini, Gemini Flash
Local self-hosteddepends on hardwareLLaMA, Mistral, Qwen

Output tokens are typically 3–4× the price of input. Long completions are expensive.

For reasoning models, internal reasoning tokens count as output — a query “thinking” for 10k tokens is paying output rate on those.

Latency intuition

For a typical chat-completion call:

  • Time to first token (TTFT): 100–500ms (frontier APIs), faster with prompt caching.
  • Tokens per second (TPS): 30–100 for frontier, 100–500 for smaller models or specialty hardware (Groq, Cerebras).
  • Total latency for 500 output tokens: 1–10 seconds.

For reasoning models with long thinking traces: 10–60+ seconds is common.

Reducing tokens

The single biggest cost lever.

Trim system prompts

Common waste:

  • Long, unfocused instructions (“be helpful, be polite, be concise, be honest, …”).
  • Verbose few-shot examples that could be shorter.
  • Tool definitions for tools never used.
  • Repeated explanations.

A focused 500-token system prompt often outperforms a 5000-token sprawling one.

Use prompt caching

Anthropic, OpenAI, Gemini all support caching of static prompt prefixes. ~10% of original cost on cache hits.

Structure prompts so static content is at the front.

[STATIC: system prompt, tool defs, RAG context that's reused]   <- cached
[DYNAMIC: current user message]                                  <- not cached

For long-context apps (load a doc, ask many questions), caching is transformative.

Trim retrieval context

If you retrieve 10 chunks but only 2 matter, you’re paying for 8. Tighten:

  • Better retrieval / reranking.
  • Dynamic k (fewer for clear queries, more for ambiguous).
  • Compact context: extract relevant sentences, not full chunks.

Shorter outputs

  • Tell the model to be concise.
  • Use schemas that limit output.
  • Hard-cap with max_tokens.

Right-sizing the model

Don’t use Opus when Haiku works.

Tiered routing

Cheap-then-expensive:

def respond(query):
    answer = haiku.complete(query)
    if classify_confidence(answer) < threshold:
        answer = sonnet.complete(query)
    return answer

Or topic-based routing:

def respond(query):
    if classify_topic(query) in HARD_TOPICS:
        return sonnet.complete(query)
    return haiku.complete(query)

Often saves 60–80% of cost without quality loss.

Distillation

Train a small fine-tuned model to match a frontier model on your task. 10–100× cheaper inference; comparable quality on the trained distribution.

Self-hosted

At high volume, self-hosting beats APIs:

  • $5K/month rented GPU vs. $50K/month API for the same throughput.
  • Break-even is roughly hundreds of millions of tokens/month.

Reducing latency

Streaming

Show tokens as they arrive. For chat, this is the difference between “broken” and “great” UX.

for chunk in client.stream(...):
    yield chunk.delta.text

Reduces perceived latency dramatically even when total latency is the same.

Smaller models / specialty hardware

For latency-critical paths:

  • Smaller models = faster.
  • Groq, Cerebras, SambaNova offer LLM inference at hundreds of TPS — sometimes 5–10× faster than mainstream providers.

Parallel requests

If you need 5 independent generations, make them concurrent:

results = await asyncio.gather(*[llm.complete(prompt) for prompt in prompts])

Speculative decoding

A small “draft” model proposes tokens; the big model verifies in parallel. 2–3× speedup with no quality change.

vLLM and other inference engines support this with minimal config.

Caching at the result level

For exact-match queries, cache full responses. Common in B2B FAQ-style apps.

For semantic-match: embed the query; if a similar past query has a cached answer, reuse.

similar = cache.search(embed(query), threshold=0.95)
if similar:
    return cached_response[similar.id]

Watch for: stale answers, leakage across users, false-positive reuse.

Prefix caching (vLLM)

Self-hosted inference engines cache the KV computation of prompt prefixes. Same idea as API prompt caching, server-side.

Reducing tool / RAG latency

For agents and RAG:

  • Parallel tool calls: when independent, call simultaneously.
  • Cached retrieval: if many queries use the same retrievals, cache.
  • Async pipelines: pre-fetch likely-needed context.
  • Pre-compute embeddings once at index time.

Batching

For batch use cases (offline eval, bulk processing), use batch APIs:

  • OpenAI Batch API: 50% off at 24-hour SLA.
  • Anthropic Message Batches: 50% off at 24h SLA.
  • Self-hosted: process queues with continuous batching (vLLM does this natively).

Cost of reasoning models

Reasoning is expensive — internal thinking tokens are paid. A query that “thinks” for 10k tokens before producing 200 visible tokens charges for 10,200 tokens.

When NOT to use reasoning:

  • Simple lookups.
  • Tasks where regular models do fine.
  • Latency-critical paths.

When to use reasoning:

  • Multi-step math, code, logic.
  • Tasks that benefit from search-and-verify.
  • High-stakes single queries where cost is acceptable.

Cost monitoring

We covered this in observability-and-tracing.md. Practical:

  • Per-feature attribution: tag requests; aggregate cost.
  • Anomaly alerts: cost > 2× last week → investigate.
  • Per-customer cost: critical for B2B with usage tiers.
  • Pre-launch projections: estimate cost at expected scale before launching.

Practical playbook

For a new LLM feature, in order:

  1. Build with the right model. Don’t over- or under-buy.
  2. Cache static prefixes (prompt caching).
  3. Trim system prompts.
  4. Stream output.
  5. Cache exact-match responses for repeat queries.
  6. Set cost and latency alerts.
  7. Measure before optimizing.
  8. At scale, consider self-hosted + distillation.

Pitfalls

  • Over-engineering before measurement: optimizing latency that’s already fine.
  • Forgetting reasoning token cost.
  • Cache-key bugs: cache hits across users when they shouldn’t.
  • Stale caches: old prompts cached after a system prompt change.
  • Hidden retries: timeout handling that retries 3× silently.

Watch it interactively

  • Cost & Latency Calculator — model picker + token sliders + optimization toggle (prompt cache, batching, speculative decoding, quantization). Predict before clicking: at 2000 input + 400 output tokens, switching from Opus to Haiku drops $/req ~10×. Adding prompt cache to Sonnet drops it another ~2× when 50% of the prefix is shared.
  • Quantization Lab — slide bits from 16 → 8 → 4 → 2; watch RMSE rise and memory drop. The math (q = round(v/scale + zero)) is on the page.
  • KV Cache — toggle caching off; watch per-step compute go quadratic.
  • Observability Trace — perturbation toggle showing what “slow LLM” or “extra retry” looks like in the trace.

Build it in code

  • /ship/14 — cost and latency tuning — five levers ranked by ROI (prompt cache, prefix cache, continuous batching, AWQ quantization, speculative decoding). Real benchmark table: 4× cheaper, 4× faster, quality flat.
  • /case-studies/04 — customer-support bot — a real product running with all these levers applied; the ROI math (~$23K/month net for 5K conversations) is the payoff.

See also