production-stack production 16 / 17 12 min read

step 16 · ship · production

Where to go from here

Distillation, post-training, multi-tenant serving, agentic patterns — what's next once the basics are live.

wrap

You’ve shipped. The stack you wrote is real: a chosen model, a serving layer, an eval harness, an API service, a RAG pipeline, tools, agents, an orchestrator, observability, prod evals, cost-tuned inference, and a public deploy. That’s more than most “AI products” you’ll see in 2026. Half of public AI startups are running on less than this.

The question now is: which directions are worth investing in next? The honest answer depends on what’s actually breaking. Below are the five most common next steps, with notes on when each is worth the time.

Direction 1 — LoRA fine-tuning

What it is. Instead of training a whole model, you train tiny “adapter” matrices that modify the base model’s behavior. You end up with a 50–500 MB file (an “adapter”) that, when loaded alongside the base model, makes it behave like a fine-tuned variant. The base model stays frozen and shared across all your fine-tunes.

When it’s worth it. Three scenarios, only:

  1. You have a domain where the base model is consistently weak (e.g. medical SOAP notes, legal-citation formatting, your company’s internal tooling vocabulary). Eval scores reveal it; LoRA fixes it; you ship.
  2. You want a stable, predictable response style that prompt-engineering keeps drifting away from. Fine-tunes lock the style in.
  3. You serve many tenants (Direction 3 below) and each tenant wants their own personality.

When it’s not worth it. Almost every other case. Prompt engineering is faster, cheaper, and reversible. Fine-tuning is a six-week investment that you’ll redo from scratch every model upgrade. Defer until prompt-engineering hits a wall you’ve measured.

Where to start. LoRA article on this site has the theory. The LoRA Lab demo lets you fiddle with rank and alpha visually. For real training, axolotl and unsloth are the dominant frameworks; both wrap PyTorch + bitsandbytes + PEFT cleanly.

Direction 2 — Distillation

What it is. Use a large, slow, expensive model (Claude Sonnet, GPT-4o, Llama-3.1-405B) to generate training data, then train a small fast cheap model on that data. The student model approximates the teacher’s behavior at a fraction of the cost. Especially powerful when paired with LoRA (distill into a LoRA on top of an already-good base).

When it’s worth it. When all three of these are true:

  1. You’re paying real money for a frontier model on production traffic.
  2. Your eval suite is robust enough to verify the distilled model didn’t regress on important tasks.
  3. Your task is narrow enough that a small model can learn it (Q&A on your docs: yes; general assistant: no).

When it’s not worth it. When you’d rather be shipping features than running a 3-week training pipeline. Distillation is the right answer at scale; it’s the wrong answer when you have one engineer and 1000 daily users.

Where to start. /ship/17 — synthetic data + distillation is the full hands-on recipe: seed prompts → teacher-generated training set → LoRA student → soft-KL + hard-CE loss → router. The companion case study /case-studies/05 — the cheapest version of itself applies that pipeline to the docs assistant from CS-01 with real cost numbers (~6.5× cheaper, ~5pp parity gap closed by routing). The standard recipe in one line: capture 50K–500K traces from production, filter for the high-quality ones (your prod-eval pipeline from step 13 is the filter), use them as training data for a smaller base.

Direction 3 — Multi-tenant serving

What it is. One base model, many LoRA adapters loaded at request time, each tenant getting their custom-trained behavior at near-zero memory cost. vLLM and TGI both support this natively (--enable-lora).

When it’s worth it. When you’re hosting custom fine-tunes per customer (e.g. “let users upload their own data and get a customized assistant”). One base + 1000 LoRAs costs the same GPU memory as one base + one LoRA. The economics are crushing.

When it’s not worth it. If all your users get the same model behavior, you don’t need this. Multi-tenancy is for B2B platforms where each customer’s tone, vocabulary, or domain differs.

Where to start. Once you’ve done Direction 1 and have a few LoRAs in hand, configuring vLLM for multi-tenant is a single flag. The harder part is the LoRA management plane: which tenant has which adapter, when to invalidate, how to charge. Build the LoRA pipeline first; multi-tenant serving is the easy part.

Direction 4 — Post-training (RLHF / DPO)

What it is. Whereas fine-tuning teaches the model “do X,” post-training teaches it “prefer X over Y.” You collect (chosen, rejected) response pairs and train the model to favor the chosen ones. RLHF (Reinforcement Learning from Human Feedback) and DPO (Direct Preference Optimization) are the two main techniques; DPO is much simpler and almost as effective.

When it’s worth it. When you have:

  • A consistent quality complaint users raise repeatedly (e.g. “responses are too verbose,” “tone is too formal,” “always recommends our competitor’s product”).
  • Real preference data — not synthetic — at a scale of at least a few thousand pairs.
  • An eval suite that catches behavior changes so you can verify the post-trained model improved on the target dimension without regressing elsewhere.

When it’s not worth it. Almost every other case. Post-training is a serious investment; the data collection alone is a multi-week effort. Most teams’ first ten “RLHF projects” should have been prompt tweaks instead.

Where to start. RLHF article covers the why. For practical DPO, the TRL library by HuggingFace is the entry point. Start by collecting 500 preference pairs from your prod traces and try a single DPO run on top of a LoRA; you’ll learn more from one end-to-end pass than from a month of reading.

Direction 5 — Agentic patterns we skipped

What it is. Step 10’s agent loop and step 11’s orchestrator are the basics. The literature has more sophisticated patterns:

  • Reflection / self-critique, where the agent reviews its own answer before finalizing.
  • Tree-of-Thought, where the agent explores multiple reasoning paths and picks the best.
  • Plan-and-execute, where a planning model produces a structured plan that worker agents execute.
  • Tool synthesis, where the agent generates new tools at runtime to handle novel sub-problems.

When it’s worth it. Each of these has a narrow range where it earns its complexity:

  • Reflection: when factuality matters more than latency. ~30 lines of code; significant quality lift. Worth doing once you’ve benchmarked the base agent.
  • Tree-of-Thought: when the task is reasoning-heavy and you can spend 5–10× tokens. Niche; great for math/code.
  • Plan-and-execute: when goals are long-horizon (10+ steps). Premature for most apps.
  • Tool synthesis: research-grade; not production-ready for most teams.

When it’s not worth it. When your single-agent loop is solving 90% of user goals already. Adding reflection is a 5-percent quality lift; adding plan-and-execute is a negative quality lift on most tasks. Measure first, then add patterns. Default to “no” on each.

Where to start. Agent articles cover the theory. The Reflection demo shows the simplest valuable pattern in interactive form. If you only add one thing from this direction, add reflection — it’s cheap and pays for itself.

A few directions I didn’t list (and why)

These come up a lot in AI-engineering discourse but rarely earn their place in a small product:

  • Custom embedding models. Fine-tune your own embedder on domain data. Real win in some cases (legal, biomedical) but ~6 weeks of work; off-the-shelf MiniLM gets you 90% of the way for free. Defer until your retrieval is your bottleneck and you’ve measured why.
  • Vector databases beyond sqlite-vec. Pinecone, Weaviate, Qdrant, etc. Necessary at 100M+ vectors or when you need cross-region replication. Below that, sqlite-vec is fine and the migration is free if you ever need it.
  • GraphRAG / Hybrid retrieval over knowledge graphs. Powerful for highly-structured domains; massive effort. Most “we need a knowledge graph” instincts turn out to be solvable with better metadata filtering on the existing vector store.
  • Multi-modal (vision, audio). Different problem space. If your product needs it, that’s a different curriculum (the /articles/12-multimodal section starts that journey). Don’t add multi-modal because it’s cool; add it because users need it.

How to actually choose

A simple decision tree from people who’ve built and shipped many of these:

  1. Can you ship a feature this week with the stack you have? Yes? Do that. You ship; you learn from users; the next round of improvements is grounded in reality, not speculation.
  2. Are users complaining about a specific quality issue? If yes, what does the eval suite say about it? If your evals don’t catch it, you can’t measure improvement. Improve the evals first. Then attack the issue.
  3. Is the bill genuinely too high? If yes, do the cost levers from step 14 you haven’t done yet. Quantization to 4-bit + speculative decoding is often a 4× cost cut. Distillation is the answer only after you’ve done the cheap things.
  4. Are you bored of shipping features and want to learn something new? That’s fine, just be honest about it. Pick LoRA fine-tuning — it’s the most generally-useful skill on this list.

Cross-references

What we shipped, end to end

For one last moment of perspective:

  • Foundations (steps 00–05). Picked a model. Ran it locally. Built an eval harness. Wrapped it in a real API.
  • Building (steps 06–11). Designed chunking. Built a vector store. Wrote a hybrid retriever with rerank. Built a tool primitive. Built an agent loop. Built a multi-agent orchestrator.
  • Production (steps 12–16). Instrumented with Phoenix. Wired prod evals. Optimized cost and latency. Deployed publicly. Mapped the next directions.

Sixteen steps. A real production AI service. Not a prototype, not a tutorial, not a notebook. Something a stranger could hit at a public URL and get value from.

The work from here is the same work you’ve been doing — read the eval scores, fix the user complaints, ship features users actually want. The stack is just the substrate. What you build on it is the thing that matters.

Now go make something.