demo
What will this actually cost?
Pick a model. Plug in your workload. See monthly bill, p50 latency, and which optimizations actually move the needle. The production-engineer's gut check, in one panel.
The four levers
- Pick a smaller model. 90% of production traffic should go to the cheapest model that's good enough. Hard tasks escalate.
- Cache the system prompt. If your 8K-token system prompt repeats every request, cache it. 10× cheaper input, 60% lower TTFT.
- Batch. Group requests at the API. Cuts cost in half. Only works if you can tolerate batch latency.
- Speculative decoding. A tiny "draft model" predicts the next few tokens; the big model verifies. Same cost, ~40% lower latency on long outputs.
Numbers caveat
The dollars and milliseconds here are illustrative — late-2025 list prices, typical observed latencies. Your actual numbers will vary with provider, region, time of day, and how chatty your prompts are. Use this for the shape of the decision, not as a procurement spreadsheet.
Anchored to 13-production/cost-and-latency.