Multi-Agent Orchestration

When one agent isn’t enough, multi-agent systems coordinate several specialists. The patterns range from trivial (a router calls one of N agents) to elaborate (debating critics, hierarchical teams). Most production multi-agent systems are simpler than they look.

When to go multi-agent

Strong signals for multi-agent:

  • Distinct specializations: one agent does retrieval, another does code, another does writing.
  • Different access scopes: a “search” agent can’t access “purchase” tools.
  • Parallelism: independent sub-tasks can run concurrently.
  • Different model classes: a cheap router decides; an expensive specialist executes.
  • Different system prompts that conflict if combined.

Signals you don’t need multi-agent (just more tools / better single agent):

  • Fewer than ~10 tools.
  • One coherent personality.
  • Mostly sequential workflow.
  • Latency is critical (each agent hop adds latency).

Pattern 1 — Router / orchestrator

A single coordinating agent decides which specialist to invoke for each subtask:

User → Router agent (cheap, broad) → picks specialist → specialist executes → returns
                              ↑__________________________________|
                              loops until done

Specialists are themselves agents (or simple LLM calls). The router:

  • Sees the user request.
  • Picks a specialist.
  • Waits for the result.
  • Decides next step (call another, finalize).

Used by Cursor (router → code agent), Claude’s project-routing patterns, support-ticket triage systems.

Pattern 2 — Pipeline

Fixed sequence: agent A → agent B → agent C. Each does its part; passes to the next.

Researcher agent → Writer agent → Editor agent → Final draft

Pros:

  • Simple to reason about.
  • Each agent has a focused job.

Cons:

  • Hard to recover from mid-pipeline failures.
  • Inflexible — each task must fit the pipeline shape.

Common in content-generation systems.

Pattern 3 — Supervisor / worker

A supervisor decomposes the task into independent worker assignments, runs them (often in parallel), aggregates:

Supervisor: Decomposes → spawns N workers → aggregates results

Worker 1, Worker 2, ..., Worker N (parallel)

Used heavily in research agents and large code-modification agents. The supervisor is often the only agent that sees the full context; workers operate on focused subtasks with smaller context windows.

This is exactly Claude Code’s pattern when it spawns sub-agents.

Pattern 4 — Debate / critic

Two or more agents take adversarial positions; their disagreement surfaces problems:

Proposer: Here's a solution X.
Critic: But X has these flaws.
Proposer: Revised: X'.
Critic: X' addresses Y but not Z.
...
Judge: Verdict — accept X' with caveat Z.

Used for high-stakes decisions, code review, factual correctness. Costs more than single-agent but catches errors that don’t surface with one perspective.

Pattern 5 — Swarm / consensus

Many agents, each tries the task independently. Results aggregated via voting, majority, ranking.

Equivalent to self-consistency at the agent level. Expensive; sometimes worth it for hard problems with verifiable rewards.

Pattern 6 — Hierarchy

Recursive: an agent at level N can spawn agents at level N+1, who can spawn level N+2, etc.

Top-level coordinator
  ↓ delegates research subtask
Research lead
  ↓ delegates fact-finding
Fact-finder, fact-finder, fact-finder

Mirrors human organizational structures. Useful for very large tasks; risky operationally (depth = cost = error compounding).

Pattern 7 — Two-tier model routing

Cheap-then-expensive:

Haiku 4.5 (cheap, fast) → handles 80% of queries
   ↓ on uncertainty / hard question
Sonnet 4.6 (capable, slower) → handles 20%

Strictly speaking not “multi-agent” — same agent loop, model swap. But operationally similar. Saves significant cost.

Communication between agents

How agents talk to each other:

Structured messages

Agent A’s output is parsed JSON; agent B’s input is the parsed JSON:

# A produces:
{"task": "summarize", "input": "...", "constraints": {...}}

# B consumes that schema directly.

Reliable, schema-validated, debuggable.

Shared blackboard / memory

Agents read/write a common state object:

state = {
    "task": "...",
    "research": [...],
    "draft": "...",
    "review_comments": [],
}

Each agent updates the shared state. The orchestrator decides who runs next based on state.

LangGraph follows this model; many production systems do.

Handoff messages

One agent emits a message to another; framework routes it:

agent_A.send(to="agent_B", content="...")

Clean for multi-step protocols; can complicate debugging if not logged carefully.

Frameworks

  • LangGraph: explicit graph of states/transitions; popular.
  • AutoGen (Microsoft): general-purpose multi-agent framework.
  • CrewAI: role-based (“you are the researcher”, “you are the writer”).
  • OpenAI Assistants / Agents SDK: thin layer over OpenAI; multi-agent via assistants.
  • Anthropic Agent SDK: Claude-native, single + multi-agent patterns.

You can also build all of this with raw Python loops + tool calls. Frameworks add ergonomics, not capabilities.

Cost & latency

Each agent hop is at least one LLM call. A 5-agent pipeline = 5+ LLM calls. Costs and latency add up:

  • Cache shared context across agents.
  • Use cheap models for routers/aggregators.
  • Run independent agents in parallel.
  • Cap iterations on debate / consensus patterns.

Failure modes specific to multi-agent

  • Telephone-game errors: information loss as it passes through agents.
  • Orphaned tasks: a worker’s result is dropped; the orchestrator forgets.
  • Conflicting outputs: two agents produce contradictory results.
  • Deadlock: agent A waits for B, B waits for A.
  • Silent context loss: agent B doesn’t see what A saw, makes a wrong decision.
  • Cost runaway: each agent thinks others will be cheap; collectively expensive.

When multi-agent isn’t worth it

A common anti-pattern: building a 5-agent system where a single agent with the same tools would do better.

The test: take the multi-agent system, flatten it into one agent with all the tools. If the single agent performs comparably, you didn’t need multi-agent.

You do need multi-agent if:

  • The agents must have different system prompts that conflict.
  • Different agents need different access (security boundaries).
  • Real parallelism saves wall-clock time.

Otherwise, prefer one capable agent over many narrow ones.

Observability

Multi-agent traces are harder to read than single-agent. Invest early:

  • Log every inter-agent message.
  • Tag traces by parent agent.
  • Visualize agent graphs in your tracing tool (Langfuse, Phoenix, custom).
  • Replay sessions for debugging.

Watch it interactively

  • Multi-Agent — supervisor + 3 workers + critic with failure-injection toggle (transient error / timeout) and retry-budget slider. Predict before clicking: with retry=0 and one worker failing, the final answer is visibly degraded (”⚠ no food recommendations — Food worker failed and supervisor’s retry budget was 0”); flip retry to 1 and the same run produces the full answer. The integration prompt panel shows what the supervisor literally sends back to the LLM.
  • Agent Trace Viewer — single-agent traces with failure injection. Multi-agent is just nested versions of this loop.

Build it in code

See also