Deployment Architectures

How you serve a model shapes everything: cost, latency, reliability, security, and what’s even feasible. The decision tree starts with API or self-hosted, then branches.

API providers

Hand the request to someone else’s GPU.

Frontier model APIs

Anthropic (Claude family).
OpenAI (GPT-4.x, GPT-4o, o-series, embeddings).
Google (Gemini family).
AWS Bedrock, Azure OpenAI, Vertex AI: cloud-vendor wrappers around frontier and open models.

Open-model serving APIs

Together AI, Fireworks, Anyscale, DeepInfra, Replicate: serve LLaMA, Qwen, Mistral, Flux, etc.
Cerebras, Groq, SambaNova: specialty hardware, often very fast inference.

Edge LLM APIs

OpenAI-compatible endpoints from many providers: drop-in replacement.
Modal, Beam, RunPod: GPU primitives for custom serving.

When APIs win

Speed of iteration: zero infra; ship in hours.
Latest models: provider keeps current.
No GPU budget: pay per token.
Burst traffic: provider scales for you.
Compliance: provider has SOC2 / HIPAA / etc., not you.

When APIs lose

High volume + cost-sensitive: at millions of calls/day, self-hosting can be much cheaper.
Data residency: sensitive data can’t leave your infra.
Custom models: your fine-tune or your weights.
Latency-critical: API round-trip floors at 100ms+.
Vendor lock-in concerns.

Self-hosted serving

Run the model on your own hardware (or rented GPUs).

Inference engines

vLLM

The de facto open-source standard. PagedAttention for efficient KV cache, continuous batching, structured output, LoRA serving.

docker run --gpus all -p 8000:8000 vllm/vllm-openai \
  --model meta-llama/Llama-3.1-8B-Instruct

OpenAI-compatible API. Defaults are reasonable; tune for your hardware.

SGLang

Strong on structured generation, batching, multi-turn agents. Often faster than vLLM for specific workloads.

TensorRT-LLM (NVIDIA)

Highest-performance NVIDIA-specific. More setup; great if you live on Hopper / Blackwell GPUs.

TGI (HuggingFace Text Generation Inference)

Production-ready, integrates well with HF ecosystem.

llama.cpp / ollama

CPU and consumer-GPU serving. GGUF quantization. Edge / on-device. Ollama wraps it for ease of use.

MLC-LLM

Cross-platform inference, including mobile.

Hardware choices (early 2026)

NVIDIA H100/H200/B200: cloud frontier; rent at AWS/GCP/Azure or specialty providers.
NVIDIA A100: still around, cheaper, fine for many workloads.
NVIDIA L40S / L4: cost-efficient inference cards.
AMD MI300X: increasingly viable; competitive on inference for LLMs that have AMD-tuned kernels.
Apple Silicon: M-series GPUs work for local dev / edge inference via MLX/llama.cpp.
Specialty: Groq (LPU), Cerebras (WSE), SambaNova for very fast inference at certain sizes.

Quantization

For self-hosting, quantization is huge:

fp16 / bf16: standard, no quality loss.
fp8: 2× memory savings, very small quality loss on capable hardware (H100+).
int8 / W8A8: 2× savings, slight quality drop.
int4 / GPTQ / AWQ: 4× savings, more measurable quality drop.
GGUF: llama.cpp’s format; many quantization levels (Q4_K_M is a common sweet spot).

Pick based on your hardware and quality budget.

Hybrid: cheap-then-expensive

Two-tier routing:

Request → cheap model (Haiku / Gemini Flash) → handles 80%
              ↓ when uncertain or hard
              expensive model (Sonnet 4.6 / GPT-5)

Often saves 40–80% of cost without quality loss.

Implementations:

LiteLLM: unified API across providers; route based on rules.
OpenRouter: similar.
Custom: simple if classify(prompt) == hard: logic.

Edge / on-device

LLMs on phones, laptops, embedded devices.

Use cases:

Privacy (data never leaves device).
Offline operation.
Ultra-low latency (no network).
Cost (no per-call charges).

Tools:

Apple Foundation Models / Apple Intelligence.
Gemini Nano on Android.
MLX (Apple Silicon).
llama.cpp / ollama for Mac/Linux/Windows.
mlc-llm for cross-platform.
ONNX Runtime / TFLite for mobile.

Models that work on edge: 1–7B parameters, quantized. Strong open candidates: Phi-4, Llama-3.2-3B, Gemma-2/3, Qwen3 small variants.

By 2026, useful AI features ship on consumer devices. Voice transcription, summarization, basic chat, simple RAG — all viable on phones.

Multi-region and latency

For global products:

Multi-region inference: deploy in US, EU, Asia. Lower latency for users in each region.
CDN-style caching: edge-cached prompts/responses where allowed.
Streaming: start delivering tokens as soon as available; first-token latency matters more than total.

API providers offer multi-region; self-hosted needs explicit multi-region orchestration.

Failover and reliability

Models go down. Plan for it:

Multi-provider fallback: primary OpenAI, fallback Anthropic, fallback to a self-hosted model.
LiteLLM or custom router handles this.
Graceful degradation: when no model is available, return a useful error or use a cached response.

Stateful serving

For agents and chat:

Conversation history persistence: Redis, Postgres, or vector DB.
Session affinity: if using prompt caching, route the user to the same instance.
Cache warming: pre-process system prompts, cache.

Fine-tuned model serving

Self-hosted is usually the right answer:

vLLM with LoRA adapters: serve N customer-specific LoRAs from one base.
Together / Anyscale fine-tuning + serving: managed flow.
OpenAI / Anthropic / Vertex hosted fine-tunes: simplest, most expensive.

Cost discipline

We unpack this in cost-and-latency.md. For deployment specifically:

Right-size your model. Don’t use Opus when Haiku works.
Right-size your hardware. Don’t pay for an H100 when an L40S is enough.
Spot / preemptible instances for batch workloads.
Reserved / committed instances for steady-state.

Practical advice

Start with a frontier API. Ship faster; learn what’s needed.
Self-host when economics demand, not before.
Always have a fallback path. APIs can break.
Streaming everywhere. UX wins.
Cache aggressively at every layer.
Measure before optimizing.