Deployment Architectures

How you serve a model shapes everything: cost, latency, reliability, security, and what’s even feasible. The decision tree starts with API or self-hosted, then branches.

API providers

Hand the request to someone else’s GPU.

Frontier model APIs

  • Anthropic (Claude family).
  • OpenAI (GPT-4.x, GPT-4o, o-series, embeddings).
  • Google (Gemini family).
  • AWS Bedrock, Azure OpenAI, Vertex AI: cloud-vendor wrappers around frontier and open models.

Open-model serving APIs

  • Together AI, Fireworks, Anyscale, DeepInfra, Replicate: serve LLaMA, Qwen, Mistral, Flux, etc.
  • Cerebras, Groq, SambaNova: specialty hardware, often very fast inference.

Edge LLM APIs

  • OpenAI-compatible endpoints from many providers: drop-in replacement.
  • Modal, Beam, RunPod: GPU primitives for custom serving.

When APIs win

  • Speed of iteration: zero infra; ship in hours.
  • Latest models: provider keeps current.
  • No GPU budget: pay per token.
  • Burst traffic: provider scales for you.
  • Compliance: provider has SOC2 / HIPAA / etc., not you.

When APIs lose

  • High volume + cost-sensitive: at millions of calls/day, self-hosting can be much cheaper.
  • Data residency: sensitive data can’t leave your infra.
  • Custom models: your fine-tune or your weights.
  • Latency-critical: API round-trip floors at 100ms+.
  • Vendor lock-in concerns.

Self-hosted serving

Run the model on your own hardware (or rented GPUs).

Inference engines

vLLM

The de facto open-source standard. PagedAttention for efficient KV cache, continuous batching, structured output, LoRA serving.

docker run --gpus all -p 8000:8000 vllm/vllm-openai \
  --model meta-llama/Llama-3.1-8B-Instruct

OpenAI-compatible API. Defaults are reasonable; tune for your hardware.

SGLang

Strong on structured generation, batching, multi-turn agents. Often faster than vLLM for specific workloads.

TensorRT-LLM (NVIDIA)

Highest-performance NVIDIA-specific. More setup; great if you live on Hopper / Blackwell GPUs.

TGI (HuggingFace Text Generation Inference)

Production-ready, integrates well with HF ecosystem.

llama.cpp / ollama

CPU and consumer-GPU serving. GGUF quantization. Edge / on-device. Ollama wraps it for ease of use.

MLC-LLM

Cross-platform inference, including mobile.

Hardware choices (early 2026)

  • NVIDIA H100/H200/B200: cloud frontier; rent at AWS/GCP/Azure or specialty providers.
  • NVIDIA A100: still around, cheaper, fine for many workloads.
  • NVIDIA L40S / L4: cost-efficient inference cards.
  • AMD MI300X: increasingly viable; competitive on inference for LLMs that have AMD-tuned kernels.
  • Apple Silicon: M-series GPUs work for local dev / edge inference via MLX/llama.cpp.
  • Specialty: Groq (LPU), Cerebras (WSE), SambaNova for very fast inference at certain sizes.

Quantization

For self-hosting, quantization is huge:

  • fp16 / bf16: standard, no quality loss.
  • fp8: 2× memory savings, very small quality loss on capable hardware (H100+).
  • int8 / W8A8: 2× savings, slight quality drop.
  • int4 / GPTQ / AWQ: 4× savings, more measurable quality drop.
  • GGUF: llama.cpp’s format; many quantization levels (Q4_K_M is a common sweet spot).

Pick based on your hardware and quality budget.

Hybrid: cheap-then-expensive

Two-tier routing:

Request → cheap model (Haiku / Gemini Flash) → handles 80%
              ↓ when uncertain or hard
              expensive model (Sonnet 4.6 / GPT-5)

Often saves 40–80% of cost without quality loss.

Implementations:

  • LiteLLM: unified API across providers; route based on rules.
  • OpenRouter: similar.
  • Custom: simple if classify(prompt) == hard: logic.

Edge / on-device

LLMs on phones, laptops, embedded devices.

Use cases:

  • Privacy (data never leaves device).
  • Offline operation.
  • Ultra-low latency (no network).
  • Cost (no per-call charges).

Tools:

  • Apple Foundation Models / Apple Intelligence.
  • Gemini Nano on Android.
  • MLX (Apple Silicon).
  • llama.cpp / ollama for Mac/Linux/Windows.
  • mlc-llm for cross-platform.
  • ONNX Runtime / TFLite for mobile.

Models that work on edge: 1–7B parameters, quantized. Strong open candidates: Phi-4, Llama-3.2-3B, Gemma-2/3, Qwen3 small variants.

By 2026, useful AI features ship on consumer devices. Voice transcription, summarization, basic chat, simple RAG — all viable on phones.

Multi-region and latency

For global products:

  • Multi-region inference: deploy in US, EU, Asia. Lower latency for users in each region.
  • CDN-style caching: edge-cached prompts/responses where allowed.
  • Streaming: start delivering tokens as soon as available; first-token latency matters more than total.

API providers offer multi-region; self-hosted needs explicit multi-region orchestration.

Failover and reliability

Models go down. Plan for it:

  • Multi-provider fallback: primary OpenAI, fallback Anthropic, fallback to a self-hosted model.
  • LiteLLM or custom router handles this.
  • Graceful degradation: when no model is available, return a useful error or use a cached response.

Stateful serving

For agents and chat:

  • Conversation history persistence: Redis, Postgres, or vector DB.
  • Session affinity: if using prompt caching, route the user to the same instance.
  • Cache warming: pre-process system prompts, cache.

Fine-tuned model serving

Self-hosted is usually the right answer:

  • vLLM with LoRA adapters: serve N customer-specific LoRAs from one base.
  • Together / Anyscale fine-tuning + serving: managed flow.
  • OpenAI / Anthropic / Vertex hosted fine-tunes: simplest, most expensive.

Cost discipline

We unpack this in cost-and-latency.md. For deployment specifically:

  • Right-size your model. Don’t use Opus when Haiku works.
  • Right-size your hardware. Don’t pay for an H100 when an L40S is enough.
  • Spot / preemptible instances for batch workloads.
  • Reserved / committed instances for steady-state.

Practical advice

  1. Start with a frontier API. Ship faster; learn what’s needed.
  2. Self-host when economics demand, not before.
  3. Always have a fallback path. APIs can break.
  4. Streaming everywhere. UX wins.
  5. Cache aggressively at every layer.
  6. Measure before optimizing.

See also