Cost & Latency Calculator

The four levers

Pick a smaller model. 90% of production traffic should go to the cheapest model that's good enough. Hard tasks escalate.
Cache the system prompt. If your 8K-token system prompt repeats every request, cache it. 10× cheaper input, 60% lower TTFT.
Batch. Group requests at the API. Cuts cost in half. Only works if you can tolerate batch latency.
Speculative decoding. A tiny "draft model" predicts the next few tokens; the big model verifies. Same cost, ~40% lower latency on long outputs.

Numbers caveat

The dollars and milliseconds here are illustrative — late-2025 list prices, typical observed latencies. Your actual numbers will vary with provider, region, time of day, and how chatty your prompts are. Use this for the shape of the decision, not as a procurement spreadsheet.

Anchored to 13-production/cost-and-latency.

What will this actually cost?

The four levers

Numbers caveat