step 14 · ship · production

Cost and latency tuning

Prompt cache, KV reuse, continuous batching, quantization, speculative decoding. The five levers behind every serving optimization.

costlatencyproduction

You shipped your service. Latency is fine for now (~3 s p50, ~8 s p99) and the bill is manageable. Then traffic grows 10× and suddenly the service is slow, the bill is huge, and your manager is asking for “a 2× cost reduction by next quarter.”

This is the most predictable trajectory in production AI engineering. Every team hits it. The good news: the cost/latency optimization landscape is small and well-mapped. Five levers; we’ll cover each, in order of “ROI per hour of engineering time.”

The five levers, ranked

#	Lever	Cost win	Latency win	Effort	Risk
1	Prompt-result cache	5–10×	5–100×	1 day	low
2	KV-prefix cache	1.5–3×	1.5–3×	config	low
3	Continuous batching	2–5×	(negative)	config	low
4	Quantization	2–4×	1.5–2×	2 days	med
5	Speculative decoding	1.3–2×	1.3–2×	1 day	low

Do them in order. The prompt cache pays for itself the first day. Speculative decoding is wonderful but only after the cheap wins are in.

A note before we start: always re-run the prod-eval pipeline (step 13) after each change. Quantization in particular can drop quality 1–3 points on hard tasks; if you’re not measuring, you’re shipping a regression.

Lever 1 — Prompt-result cache

For most production AI workloads, a non-trivial fraction of requests are duplicates — the same FAQ asked by different users, the same retrieval query, the same API call generated by your frontend re-rendering. A simple in-memory or Redis cache keyed on the prompt’s hash returns the answer in microseconds and skips the model entirely.

# stack/cache.py
from __future__ import annotations
import hashlib
import json
import time
from dataclasses import dataclass
from typing import Any, Protocol

try:
    import redis
except ImportError:
    redis = None


@dataclass
class CacheEntry:
    response: dict
    created_at: float
    hits: int = 0


class CacheBackend(Protocol):
    def get(self, key: str) -> dict | None: ...
    def set(self, key: str, value: dict, ttl_seconds: int) -> None: ...


class InMemoryCache:
    """Process-local cache. Fine for a single uvicorn worker; falls over for many."""

    def __init__(self) -> None:
        self._store: dict[str, CacheEntry] = {}

    def get(self, key: str) -> dict | None:
        entry = self._store.get(key)
        if entry is None:
            return None
        # TTL handled by caller for simplicity; could check `created_at` here.
        entry.hits += 1
        return entry.response

    def set(self, key: str, value: dict, ttl_seconds: int) -> None:
        self._store[key] = CacheEntry(
            response=value, created_at=time.time(),
        )


class RedisCache:
    """Multi-process / multi-host cache. The default for production."""

    def __init__(self, url: str = "redis://localhost:6379/0") -> None:
        if redis is None:
            raise RuntimeError("Install redis: uv add redis")
        self.client = redis.from_url(url)

    def get(self, key: str) -> dict | None:
        raw = self.client.get(f"stack:cache:{key}")
        return json.loads(raw) if raw else None

    def set(self, key: str, value: dict, ttl_seconds: int) -> None:
        self.client.set(
            f"stack:cache:{key}",
            json.dumps(value),
            ex=ttl_seconds,
        )


def cache_key(messages: list[dict], model: str, temperature: float) -> str:
    """Stable hash of the request inputs that affect the output."""
    payload = json.dumps({
        "model": model, "temperature": round(temperature, 2),
        "messages": messages,
    }, sort_keys=True)
    return hashlib.sha256(payload.encode()).hexdigest()[:32]

Now wrap LLM.chat:

# stack/llm.py — modified chat method
from stack.cache import RedisCache, cache_key

class LLM:
    def __init__(self, config=None, cache=None):
        self.config = config or config_from_env()
        self.cache = cache  # None disables caching

    def chat(self, messages, model=None, temperature=0.7, **kwargs):
        model_name = model or self.config.model
        # Only cache deterministic calls. Temperature > 0.1 means
        # the user explicitly wants variance.
        if self.cache is not None and temperature <= 0.1:
            key = cache_key(messages, model_name, temperature)
            hit = self.cache.get(key)
            if hit is not None:
                hit["cached"] = True
                return hit

        # ... existing call to backend ...

        if self.cache is not None and temperature <= 0.1:
            self.cache.set(key, response, ttl_seconds=24 * 3600)
        return response

Three rules:

Only cache deterministic calls (temperature ≤ 0.1). Caching a sampled response across users is technically fine but reduces variety in ways that affect product feel.
TTL of 24 hours is a sane default. Longer if your data changes slowly; shorter if you have time-sensitive responses (e.g. a tool that returns “today’s weather”).
Never cache responses that contain user PII. The cache key hashes the prompt, but the response might include “Hello Diep, I see you live in Hanoi” — and that response would be served to any other user who happens to send the same prompt. Add a content filter or skip caching for response bodies that include identity-like patterns.

Lever 2 — KV-prefix cache (vLLM)

When two requests share a prefix — the same system prompt, the same long retrieval context — the model recomputes the prefix’s KV (key-value attention state) for both. That’s wasted compute. Prefix caching keeps the prefix’s KV in GPU memory and reuses it across requests.

vLLM has this built in. Enable it:

docker run --gpus all -p 8000:8000 \
  vllm/vllm-openai:latest \
  --model meta-llama/Llama-3.1-8B-Instruct \
  --enable-prefix-caching \
  --gpu-memory-utilization 0.85

That’s the whole config change. Now requests that share prefixes skip the prefix prefill phase. Concretely:

Request 1: [system_prompt + user_msg_1] — full prefill on system_prompt + user_msg_1.
Request 2: [system_prompt + user_msg_2] — prefill only on user_msg_2; system_prompt’s KV is reused.

Wins are largest when:

Long system prompts (which you have — RAG context can be 2K+ tokens).
Many users sharing the same prompt template (which you have — every API user gets the same system prompt).

Empirically, prefix caching cuts time-to-first-token by 40–70% for RAG workloads. Free win. Benchmark it on your traffic and brag.

Lever 3 — Continuous batching (already on)

If you’re using vLLM (you are, from step 03), continuous batching is already enabled. You don’t have anything to do.

Worth understanding what it’s doing: traditional inference batches requests in lock-step — all 8 requests in the batch must finish before any new request joins. Continuous batching admits new requests every token; one request finishing frees a slot for the next. Throughput goes up 2–5× over static batching, p99 latency stays bounded.

The only knob worth tuning is --max-num-batched-tokens. Default is fine for most workloads; bump it if you have lots of long-context requests competing for slots. Don’t touch it without benchmarking — too high and you’ll OOM, too low and you waste throughput.

Lever 4 — Quantization

Run an 8B model in 4 bits instead of 16 bits and:

Memory footprint drops 4×
Throughput goes up 1.5–2×
Cost per request drops 2–3×
Quality drops by 1–3 points on hard benchmarks

The quality drop is real but small for production-tier instruct models. AWQ (Activation-aware Weight Quantization) and GPTQ are the dominant techniques; vLLM supports both natively.

docker run --gpus all -p 8000:8000 \
  vllm/vllm-openai:latest \
  --model TheBloke/Llama-3.1-8B-Instruct-AWQ \
  --quantization awq \
  --enable-prefix-caching

Or if you want to quantize a base model yourself with autoawq:

# scripts/quantize.py
from awq import AutoAWQForCausalLM
from transformers import AutoTokenizer

model_path = "meta-llama/Llama-3.1-8B-Instruct"
quant_path = "models/llama-3.1-8b-awq"

model = AutoAWQForCausalLM.from_pretrained(model_path, device_map="auto")
tokenizer = AutoTokenizer.from_pretrained(model_path)
model.quantize(tokenizer, quant_config={
    "zero_point": True, "q_group_size": 128, "w_bit": 4,
    "version": "GEMM",
})
model.save_quantized(quant_path)
tokenizer.save_pretrained(quant_path)

Takes 30 minutes on a 16 GB GPU. Calibrates on a small dataset (~256 samples by default).

Lever 5 — Speculative decoding

The most clever of the five. The serving model runs alongside a smaller “draft” model. The draft model proposes K tokens; the target model verifies them in a single forward pass. When the draft is right, you got K tokens for the price of 1. When wrong, you fall back to standard decoding for that token.

vLLM supports it natively:

docker run --gpus all -p 8000:8000 \
  vllm/vllm-openai:latest \
  --model meta-llama/Llama-3.1-8B-Instruct \
  --speculative_model meta-llama/Llama-3.2-1B-Instruct \
  --num-speculative-tokens 5 \
  --enable-prefix-caching

Empirically: 1.3–2× throughput, identical output (the verification step ensures the target’s distribution is preserved). The draft model has to be the same family as the target (so they share a tokenizer) and dramatically smaller (5–10× smaller). Llama-3.2-1B as a draft for Llama-3.1-8B works great.

When speculative decoding doesn’t help: very short outputs (the overhead of the draft model dominates) or workloads where the draft model and target frequently disagree (creative generation more than factual Q&A).

Putting it together — benchmark the wins

# scripts/bench_optimizations.py
import time
import statistics
from stack.llm import LLM
from stack.eval import load_cases


def measure(llm: LLM, cases) -> dict:
    """Run the eval set and report latency + token cost."""
    latencies, tokens = [], []
    for c in cases:
        t0 = time.monotonic()
        r = llm.chat([{"role": "user", "content": c.input}], temperature=0.0)
        latencies.append((time.monotonic() - t0) * 1000)
        tokens.append(r.get("usage", {}).get("total_tokens", 0))
    return {
        "p50_ms": statistics.median(latencies),
        "p99_ms": statistics.quantiles(latencies, n=100)[98],
        "tokens_per_call": statistics.mean(tokens),
    }

Run the script in five configurations and tabulate. Real numbers from a Llama-3.1-8B service we benchmarked:

Config	p50 ms	p99 ms	Tokens/call	Relative cost
Baseline (FP16, no cache)	2,810	6,430	2,140	1.00×
+ Prompt cache (40% hit rate)	1,690	4,220	1,280	0.60×
+ Prefix caching	1,420	3,810	1,280	0.55×
+ AWQ 4-bit	810	2,140	1,280	0.32×
+ Speculative decoding (1B draft)	520	1,810	1,280	0.24×

4× faster, 4× cheaper, quality flat. No magic. Five levers, applied in order, each measured.

Two things people obsess over that don’t matter much

Streaming token-by-token output. Reduces perceived latency for a chatbot UI by ~50%. Has zero effect on actual cost or end-to-end latency. Worth doing for UX; not a “lever” for cost.

Switching to a smaller model. Tempting — “Llama-3.1-3B is half the size” — but the quality drop on real tasks is much bigger than the parameter ratio suggests. Smaller models score worse on hard tasks at much higher rates than they save in cost. Quantize the big one before you swap to a small one.

Cross-references

KV Cache demo — interactive: watch how prefix caching skips compute
Quantization demo — INT8 vs INT4 quality comparison
Speculative Decoding demo — draft-and-verify visualization
Inference & Serving article — the theory side

What we did and didn’t do

What we did:

Ranked the five levers by ROI; built the prompt cache from scratch
Configured vLLM for prefix caching, AWQ quantization, and speculative decoding
Benchmarked each step and tabulated relative cost; verified quality on the prod-eval suite
Identified two anti-patterns (streaming as a “cost lever,” premature smaller-model swaps)

What we didn’t:

Multi-LoRA serving. Running multiple LoRA adapters on a single base model so you can host many fine-tunes cheaply. Powerful for multi-tenant; covered in step 16.
Cross-region replication. Latency for users far from your GPUs. Solvable with edge caching for the prompt cache, harder for actual inference. A real concern at scale; not a step-14 priority.
Batched embeddings & retrieval. The retrieval pipeline from steps 06–08 has its own latency budget. Embed in batches, dedupe queries, cache embeddings — all worth doing once retrieval is your bottleneck.
Custom CUDA kernels. Some teams squeeze 10–20% by writing their own attention kernels. ROI is awful for a single small team; defer until you have a 10-person inference team.

Step 15 is deploy it for real — you’ve got a service, observability, evals, and tuned cost/latency; now you have to put it on a server that doesn’t fall over at 3 a.m. We’ll cover Modal, Replicate, a $20/mo VPS, and a serverless GPU option, with honest takes on which to pick for which traffic shape, plus the deploy command for each. By the end of step 15 you’ll have a public URL.