Deployment Architectures
How you serve a model shapes everything: cost, latency, reliability, security, and what’s even feasible. The decision tree starts with API or self-hosted, then branches.
API providers
Hand the request to someone else’s GPU.
Frontier model APIs
- Anthropic (Claude family).
- OpenAI (GPT-4.x, GPT-4o, o-series, embeddings).
- Google (Gemini family).
- AWS Bedrock, Azure OpenAI, Vertex AI: cloud-vendor wrappers around frontier and open models.
Open-model serving APIs
- Together AI, Fireworks, Anyscale, DeepInfra, Replicate: serve LLaMA, Qwen, Mistral, Flux, etc.
- Cerebras, Groq, SambaNova: specialty hardware, often very fast inference.
Edge LLM APIs
- OpenAI-compatible endpoints from many providers: drop-in replacement.
- Modal, Beam, RunPod: GPU primitives for custom serving.
When APIs win
- Speed of iteration: zero infra; ship in hours.
- Latest models: provider keeps current.
- No GPU budget: pay per token.
- Burst traffic: provider scales for you.
- Compliance: provider has SOC2 / HIPAA / etc., not you.
When APIs lose
- High volume + cost-sensitive: at millions of calls/day, self-hosting can be much cheaper.
- Data residency: sensitive data can’t leave your infra.
- Custom models: your fine-tune or your weights.
- Latency-critical: API round-trip floors at 100ms+.
- Vendor lock-in concerns.
Self-hosted serving
Run the model on your own hardware (or rented GPUs).
Inference engines
vLLM
The de facto open-source standard. PagedAttention for efficient KV cache, continuous batching, structured output, LoRA serving.
docker run --gpus all -p 8000:8000 vllm/vllm-openai \
--model meta-llama/Llama-3.1-8B-Instruct
OpenAI-compatible API. Defaults are reasonable; tune for your hardware.
SGLang
Strong on structured generation, batching, multi-turn agents. Often faster than vLLM for specific workloads.
TensorRT-LLM (NVIDIA)
Highest-performance NVIDIA-specific. More setup; great if you live on Hopper / Blackwell GPUs.
TGI (HuggingFace Text Generation Inference)
Production-ready, integrates well with HF ecosystem.
llama.cpp / ollama
CPU and consumer-GPU serving. GGUF quantization. Edge / on-device. Ollama wraps it for ease of use.
MLC-LLM
Cross-platform inference, including mobile.
Hardware choices (early 2026)
- NVIDIA H100/H200/B200: cloud frontier; rent at AWS/GCP/Azure or specialty providers.
- NVIDIA A100: still around, cheaper, fine for many workloads.
- NVIDIA L40S / L4: cost-efficient inference cards.
- AMD MI300X: increasingly viable; competitive on inference for LLMs that have AMD-tuned kernels.
- Apple Silicon: M-series GPUs work for local dev / edge inference via MLX/llama.cpp.
- Specialty: Groq (LPU), Cerebras (WSE), SambaNova for very fast inference at certain sizes.
Quantization
For self-hosting, quantization is huge:
- fp16 / bf16: standard, no quality loss.
- fp8: 2× memory savings, very small quality loss on capable hardware (H100+).
- int8 / W8A8: 2× savings, slight quality drop.
- int4 / GPTQ / AWQ: 4× savings, more measurable quality drop.
- GGUF: llama.cpp’s format; many quantization levels (Q4_K_M is a common sweet spot).
Pick based on your hardware and quality budget.
Hybrid: cheap-then-expensive
Two-tier routing:
Request → cheap model (Haiku / Gemini Flash) → handles 80%
↓ when uncertain or hard
expensive model (Sonnet 4.6 / GPT-5)
Often saves 40–80% of cost without quality loss.
Implementations:
- LiteLLM: unified API across providers; route based on rules.
- OpenRouter: similar.
- Custom: simple
if classify(prompt) == hard:logic.
Edge / on-device
LLMs on phones, laptops, embedded devices.
Use cases:
- Privacy (data never leaves device).
- Offline operation.
- Ultra-low latency (no network).
- Cost (no per-call charges).
Tools:
- Apple Foundation Models / Apple Intelligence.
- Gemini Nano on Android.
- MLX (Apple Silicon).
- llama.cpp / ollama for Mac/Linux/Windows.
- mlc-llm for cross-platform.
- ONNX Runtime / TFLite for mobile.
Models that work on edge: 1–7B parameters, quantized. Strong open candidates: Phi-4, Llama-3.2-3B, Gemma-2/3, Qwen3 small variants.
By 2026, useful AI features ship on consumer devices. Voice transcription, summarization, basic chat, simple RAG — all viable on phones.
Multi-region and latency
For global products:
- Multi-region inference: deploy in US, EU, Asia. Lower latency for users in each region.
- CDN-style caching: edge-cached prompts/responses where allowed.
- Streaming: start delivering tokens as soon as available; first-token latency matters more than total.
API providers offer multi-region; self-hosted needs explicit multi-region orchestration.
Failover and reliability
Models go down. Plan for it:
- Multi-provider fallback: primary OpenAI, fallback Anthropic, fallback to a self-hosted model.
- LiteLLM or custom router handles this.
- Graceful degradation: when no model is available, return a useful error or use a cached response.
Stateful serving
For agents and chat:
- Conversation history persistence: Redis, Postgres, or vector DB.
- Session affinity: if using prompt caching, route the user to the same instance.
- Cache warming: pre-process system prompts, cache.
Fine-tuned model serving
Self-hosted is usually the right answer:
- vLLM with LoRA adapters: serve N customer-specific LoRAs from one base.
- Together / Anyscale fine-tuning + serving: managed flow.
- OpenAI / Anthropic / Vertex hosted fine-tunes: simplest, most expensive.
Cost discipline
We unpack this in cost-and-latency.md. For deployment specifically:
- Right-size your model. Don’t use Opus when Haiku works.
- Right-size your hardware. Don’t pay for an H100 when an L40S is enough.
- Spot / preemptible instances for batch workloads.
- Reserved / committed instances for steady-state.
Practical advice
- Start with a frontier API. Ship faster; learn what’s needed.
- Self-host when economics demand, not before.
- Always have a fallback path. APIs can break.
- Streaming everywhere. UX wins.
- Cache aggressively at every layer.
- Measure before optimizing.