step 11 · ship · building
Multi-agent orchestration
Supervisor + workers, fan-out / fan-in, when to use it, when one agent is enough. Patterns for reliability.
Multi-agent is one of the most over-applied patterns in AI engineering right now. Hop on Twitter and you’ll see “12-agent architectures” for tasks that a single well-prompted agent could finish in three turns. Most of the time, multi-agent is a way to spend more tokens to get a worse answer, slower.
But there are real cases where it pays off — and when it does, the pattern is small and well-defined. We’ll build one of those small patterns end-to-end, and along the way we’ll be honest about when to not use it.
When multi-agent actually helps
Three legitimate reasons, in decreasing order of how often they apply:
- Parallelism. The task naturally fans out into independent subtasks. “Compare frameworks A, B, C” → three workers, one per framework, run in parallel, supervisor merges. Wall-clock time drops by ~3×; quality stays the same or improves because each worker has more context budget per subtask.
- Specialization via different system prompts. A “researcher” agent with patient-search instructions and a “writer” agent with tight-prose instructions produce better artifacts than a single generalist. The same model with different prompts; same call cost.
- Adversarial review. A “critic” agent that’s prompted to find flaws ships fewer hallucinated answers than a single agent reviewing its own output. Costs one extra LLM turn; pays for itself on factuality.
Three illegitimate reasons:
- “Modularity.” Splitting an agent into five agents because you want clean code. The framework overhead exceeds the readability benefit; one agent with well-organized tools is almost always cleaner.
- “Safety via committee.” Three agents voting doesn’t make a wrong answer right; it just makes you 3× as confident in it. (The exception: when each agent has different tools or knowledge, voting can help. But that’s parallelism, not voting.)
- “It’s the future.” Sometimes is. Today, isn’t.
The pattern we’ll build covers all three legitimate cases: a supervisor that decomposes, workers that fan out, a critic that reviews. Skip the orchestrator if your task doesn’t benefit from one of those three.
The supervisor / workers / critic pattern
┌──────────┐
goal ──→ │Supervisor│ ──── decomposes into N subtasks
└────┬─────┘
│ fan-out
┌────────────┼────────────┐
▼ ▼ ▼
┌────────┐ ┌────────┐ ┌────────┐
│Worker A│ │Worker B│ │Worker C│ (parallel; each is a step-10 Agent)
└────┬───┘ └────┬───┘ └────┬───┘
│ │ │
└────────────┼────────────┘
│ fan-in
▼
┌────────┐
│Combiner│ ── merges into a draft answer
└────┬───┘
▼
┌────────┐
│ Critic │ ── reviews; emits issues or "ship it"
└────┬───┘
▼
final
Five total LLM calls minimum (supervisor, three workers, critic) plus tool calls inside each worker. We could add a revise step where the combiner re-runs given the critic’s notes; we’ll keep that as an exercise.
The supervisor
# stack/orchestrator.py
from __future__ import annotations
import asyncio
import json
import re
import time
from dataclasses import dataclass, field
from typing import Callable
from stack.llm import LLM
from stack.tools import ToolRegistry
from stack.agent import Agent, AgentConfig, AgentResult
@dataclass
class Subtask:
"""A piece of work the supervisor hands to a worker."""
title: str
instructions: str
@dataclass
class WorkerOutput:
"""A worker's findings."""
subtask: Subtask
result: AgentResult
@dataclass
class OrchestratorResult:
"""The final return."""
final: str
subtasks: list[Subtask]
workers: list[WorkerOutput]
critique: str
total_seconds: float
total_tokens: int
SUPERVISOR_PROMPT = """\
You are a supervisor agent. You break a user's research question into 2–4
independent subtasks that workers can investigate in parallel. Output ONLY
a JSON list of objects with keys "title" and "instructions". Example:
[
{"title": "Investigate framework X", "instructions": "Find X's strengths, weaknesses, and ecosystem maturity."},
{"title": "Investigate framework Y", "instructions": "Find Y's strengths, weaknesses, and ecosystem maturity."}
]
Subtasks must be:
- Independent (a worker can solve one without seeing the others' results).
- Specific enough to act on with the available tools.
- Of similar size, so parallel runtime is balanced.
"""
class Supervisor:
"""Decomposes a user goal into subtasks via a single LLM call."""
def __init__(self, llm: LLM, temperature: float = 0.2) -> None:
self.llm = llm
self.temperature = temperature
def decompose(self, user_goal: str) -> list[Subtask]:
response = self.llm.chat(
messages=[
{"role": "system", "content": SUPERVISOR_PROMPT},
{"role": "user", "content": user_goal},
],
temperature=self.temperature,
)
text = response["choices"][0]["message"]["content"] or "[]"
try:
obj = json.loads(_strip_codefence(text))
return [Subtask(**item) for item in obj]
except Exception:
# Fallback: one subtask = the whole goal. Worker handles it.
return [Subtask(title="Full goal", instructions=user_goal)]
def _strip_codefence(s: str) -> str:
"""Strip ```json ... ``` if the model wrapped its output."""
s = s.strip()
s = re.sub(r"^```(?:json)?\s*", "", s)
s = re.sub(r"\s*```$", "", s)
return s
The supervisor is a single LLM call that emits JSON. No tools, no loop. Decomposition is the one job; if it fails, we degrade to “one subtask = the whole goal” and let a single worker handle it. Fail-soft is part of the contract.
The workers
Each worker is just a step-10 Agent with a goal-specific user message:
WORKER_SYSTEM_PROMPT = """\
You are a focused research agent. You'll be given one specific subtask.
Use the available tools to investigate. When you're confident, produce
a concise written summary (under 200 words) that another agent can
combine with peers' work. Cite tool results inline using [chunk_id]
or [tool_name].
"""
def run_worker(
llm: LLM,
registry: ToolRegistry,
subtask: Subtask,
config: AgentConfig | None = None,
) -> WorkerOutput:
"""Run a single worker agent on one subtask."""
agent = Agent(
llm,
registry,
WORKER_SYSTEM_PROMPT,
config or AgentConfig(max_iters=6, max_seconds=30.0),
)
result = agent.run(
f"Subtask: {subtask.title}\n\n"
f"Instructions: {subtask.instructions}\n\n"
f"Produce your concise summary now."
)
return WorkerOutput(subtask=subtask, result=result)
Tighter budgets than the single-agent run from step 10: workers should be cheap. If a single worker needs 10 minutes to handle a subtask, that’s a sign the supervisor split too coarsely.
Parallel fan-out
async def run_workers_parallel(
llm: LLM,
registry: ToolRegistry,
subtasks: list[Subtask],
config: AgentConfig | None = None,
) -> list[WorkerOutput]:
"""Run all workers in parallel via asyncio. Each call is sync but isolated."""
loop = asyncio.get_running_loop()
return await asyncio.gather(*[
loop.run_in_executor(None, run_worker, llm, registry, st, config)
for st in subtasks
])
Each worker is a synchronous Agent.run call; we put each on a thread-pool executor and let asyncio fan them out. Three workers in parallel finish in roughly the time of the slowest one, not the sum.
The combiner and critic
Combiner: a templated merge, no LLM call needed.
def combine(workers: list[WorkerOutput]) -> str:
"""Merge worker outputs into a single draft. Cheap; no LLM."""
sections = []
for w in workers:
sections.append(f"## {w.subtask.title}\n\n{w.result.final}")
return "\n\n".join(sections)
We could ask an LLM to write a synthesis pass. We don’t, because cheap-and-mechanical works for most fan-out cases — each worker’s section is already self-contained — and avoids one more chance to hallucinate. A synthesis pass is appropriate when the subtasks have to be reconciled (e.g. “the workers reported different things about X”); it’s not appropriate when they’re independent investigations.
Critic: a single LLM call that reviews the draft.
CRITIC_PROMPT = """\
You are a critic. You're given a draft answer assembled from multiple
research workers. Your job is to identify problems:
- Factual claims that aren't supported by the worker citations.
- Internal contradictions between sections.
- Vague or hand-wavy passages that need specifics.
- Missing information that the user actually asked for.
Output a SHORT critique (under 150 words). Start with one of:
- "SHIP IT" if the draft is solid.
- "REVISE" if it has issues — list them as a numbered list.
Don't be polite; be useful.
"""
class Critic:
"""A single LLM call that reviews the combined draft."""
def __init__(self, llm: LLM, temperature: float = 0.0) -> None:
self.llm = llm
self.temperature = temperature
def review(self, user_goal: str, draft: str) -> str:
response = self.llm.chat(
messages=[
{"role": "system", "content": CRITIC_PROMPT},
{"role": "user", "content":
f"User asked: {user_goal}\n\n--- DRAFT ---\n{draft}"},
],
temperature=self.temperature,
)
return response["choices"][0]["message"]["content"] or ""
Temperature 0.0 for the critic. Critics should be predictable; we want the same draft to get the same critique on different runs.
The orchestrator
class Orchestrator:
"""Supervisor + workers + critic, glued."""
def __init__(
self,
llm: LLM,
registry: ToolRegistry,
worker_config: AgentConfig | None = None,
) -> None:
self.supervisor = Supervisor(llm)
self.critic = Critic(llm)
self.llm = llm
self.registry = registry
self.worker_config = worker_config
def run(self, user_goal: str) -> OrchestratorResult:
start = time.monotonic()
# 1. Decompose
subtasks = self.supervisor.decompose(user_goal)
# 2. Fan out
workers = asyncio.run(run_workers_parallel(
self.llm, self.registry, subtasks, self.worker_config,
))
# 3. Combine
draft = combine(workers)
# 4. Critique
critique = self.critic.review(user_goal, draft)
# 5. Tally
total_tokens = sum(w.result.total_tokens for w in workers)
elapsed = time.monotonic() - start
# 6. Final assembly. If the critic said REVISE, we expose the
# critique so the caller can decide whether to revise. A more
# advanced orchestrator would loop here; we keep it explicit.
if critique.strip().upper().startswith("SHIP IT"):
final = draft
else:
final = f"{draft}\n\n---\n## Critic notes\n{critique}"
return OrchestratorResult(
final=final,
subtasks=subtasks,
workers=workers,
critique=critique,
total_seconds=elapsed,
total_tokens=total_tokens,
)
The orchestrator is glue. ~30 lines once you have the parts. Don’t let “multi-agent” sound mystical — the entire conceptual surface area is what’s on this page.
The runner script
# stack/orchestrator.py (continued)
from stack.tools import (
ToolRegistry, tool_from_callable, now, search_docs,
)
from stack.agent import fetch_chunk
if __name__ == "__main__":
llm = LLM()
registry = ToolRegistry()
for fn in (now, search_docs, fetch_chunk):
registry.register(tool_from_callable(fn))
orch = Orchestrator(llm, registry)
result = orch.run(
"Compare FastAPI, Flask, and Django for building a small JSON API. "
"I want to pick the one with the best developer experience for solo work."
)
print(f"\n=== orchestrator: {result.total_seconds:.1f}s, "
f"{result.total_tokens} tokens, {len(result.subtasks)} subtasks ===\n")
print("subtasks:")
for st in result.subtasks:
print(f" - {st.title}: {st.instructions[:80]}…")
print(f"\n=== draft ===\n{result.final}\n")
print(f"=== critic ===\n{result.critique}\n")
Run it:
uv run python -m stack.orchestrator
Expected output (Llama-3.1-8B; varies):
=== orchestrator: 22.4s, 8431 tokens, 3 subtasks ===
subtasks:
- Investigate FastAPI: Find FastAPI's strengths, weaknesses, and DX for solo JSON-API work…
- Investigate Flask: Same as above for Flask…
- Investigate Django: Same as above for Django (consider DRF where relevant)…
=== draft ===
## Investigate FastAPI
FastAPI is a modern ASGI framework with first-class async, automatic
OpenAPI docs, and Pydantic-based validation [doc-fastapi-001]…
## Investigate Flask
Flask is a minimalist WSGI microframework. Solo DX is excellent for
small APIs but you'll add libraries for validation, OpenAPI…
## Investigate Django
Django + DRF is heavy for a small API but unmatched if you want admin,
ORM, and auth out of the box…
---
## Critic notes
REVISE
1. The FastAPI section claims it's "the most popular" without a
tool-cited source. Soften or cite.
2. Missing direct recommendation. The user asked which to pick;
the draft compares but doesn't conclude.
Three things that just happened:
- The supervisor split the goal into three parallel investigations. Wall clock: ~22s, but each worker took ~18–20s, so we got real parallelism.
- The critic caught a hallucination (“the most popular”) and a structural omission (no concrete recommendation). Both are common single-agent failure modes that a critic catches cheaply.
- The total token cost is roughly 3× a single-agent run. Fan-out parallelism saves wall-clock time; it doesn’t save tokens. If your bottleneck is token cost, multi-agent is a tax. If it’s latency, multi-agent is a refund.
When to skip the orchestrator entirely
A flowchart for “should I use multi-agent”:
- Is the task naturally parallel (independent subtasks)? → Yes: orchestrator wins.
- Is wall-clock latency your dominant cost? → Yes: orchestrator wins.
- Is the answer hallucination-prone (factual research, multi-source synthesis)? → Yes: at least add a critic.
- Otherwise → single agent. Tighter, cheaper, easier to debug.
We’ve shipped many production agents. Most of them are single agents. A handful have a critic step. Two have a full supervisor / workers / critic. The orchestrator is the right tool for the right job; the right job is rarer than the literature suggests.
Cross-references
- Multi-Agent demo — interactive: a similar fan-out / fan-in visualizer
- Reflection demo — the critic step in isolation
- Multi-Agent Orchestration article — the theory behind supervisor / worker / critic
- Planning and Reflection article — when planning steps add up; when they don’t
What we did and didn’t do
What we did:
- A
Supervisorthat decomposes a goal into 2–4 parallel subtasks Workeragents that fan out via a thread-pool executor- A mechanical
combinethat merges worker outputs without an extra LLM call - A
Criticthat reviews the draft and emits ship/revise notes - An
Orchestratorthat glues it all in ~30 lines - An honest take on when multi-agent is warranted (rarer than you’d think)
What we didn’t:
- Iterative revision. When the critic says “REVISE,” a real orchestrator would re-run the workers (or a synthesis agent) with the critique appended. ~50 lines to add. Worth it once you have eval data showing critic notes correlate with quality drops.
- Conditional routing. Some tasks should fan out, others shouldn’t. A “router” agent that decides between single-agent and orchestrator paths. Useful at scale; premature for a side project.
- Inter-worker communication. Workers don’t see each other’s progress in our pattern. Sometimes useful (one worker’s discovery should redirect another’s investigation). Adds significant complexity; defer until proven necessary.
- Persistent memory across runs. Workers start fresh each call. For a customer-facing assistant that should learn from previous sessions, you’ll want a persistence layer. That’s a step-13 / step-14 concern.
Next — and the end of the Building release
You now have a working LLM stack: model selection, local serving, evals, an OpenAI-compatible API, RAG retrieval, tools, single agents, and an orchestrator. That’s enough to ship a real product. Half of public AI startups are running on less.
The Production release (steps 12–16) is about making it robust enough to run unattended. Step 12 is observability — structured tracing of every prompt, response, tool call, and agent step, so when something breaks at 2 a.m. you can find out why before your users do. Step 13 is production evals — the eval harness from step 04 wired to the live service so quality regressions get caught at deploy time. Step 14 is cost & latency — caches, batching, KV reuse, the operational levers you’ll pull when traffic grows. Step 15 is deploy — Docker, Kubernetes basics, what to actually do at the boundary between your dev box and a real server. Step 16 closes with what’s next — paths to LoRA fine-tuning, multimodal, and the production patterns we didn’t have time to cover.
Take a breath. The Building release is done. From here, the work is operational, not architectural.