production-stack building 10 / 17 28 min read · 40 min hands-on

step 10 · ship · building

Build an agent loop

ReAct + state management + tool execution + termination. Why most agents are simpler than they look.

agents

In step 09 we built the primitive — the model can call functions in your stack. But call_with_tools is a single conversation turn with a hard iteration cap; it’s not yet an agent. An agent has a goal, decides on its own what to do next, knows when to stop, and operates within a budget you define. It’s a loop with state.

Here’s the unsettling truth that takes most engineers a beat to absorb: the agent loop is genuinely simple. ReAct, tool-using LLMs, “agentic AI” — the literature can make this sound like there’s some deep algorithmic insight you’re missing. There isn’t. There’s a while loop, a few hundred lines of state management, and a lot of careful prompt engineering. By the end of step 10 you’ll have written one and it will work. The hard parts are evaluation (step 13), cost (step 14), and failure modes (this article, sections below).

What an agent actually is

Three components that don’t exist in plain call_with_tools:

  1. A goal. A user message that the agent will pursue across multiple turns until it produces a final answer or hits a budget.
  2. A budget. Wall-clock time, token count, or step count — usually all three. The agent stops when any cap is hit.
  3. State across turns. The agent’s history, scratchpad, intermediate observations. The same messages=[...] we already have, plus structured logging so we can audit what the agent did after the fact.

That’s it. Everything else — planning, reflection, multi-agent coordination — is a layer on top.

The shape of the agent

# stack/agent.py
from __future__ import annotations
import json
import time
from dataclasses import dataclass, field
from typing import Callable

from stack.llm import LLM
from stack.tools import ToolRegistry


@dataclass
class StepLog:
    """One iteration of the loop. Useful for audit and debugging."""
    iteration: int
    role: str                       # "assistant" | "tool"
    content: str | None = None
    tool_name: str | None = None
    tool_args: dict | None = None
    tool_result: str | None = None
    elapsed_ms: float = 0.0
    tokens: int = 0                 # if the backend reports usage


@dataclass
class AgentResult:
    """What the agent returns when it stops."""
    final: str                      # the answer (empty if hit a budget cap)
    steps: list[StepLog]
    stop_reason: str                # "done" | "max_iters" | "max_seconds" | "max_tokens" | "error"
    elapsed_seconds: float
    total_tokens: int


@dataclass
class AgentConfig:
    """Tunable budgets and behavior."""
    max_iters: int = 10
    max_seconds: float = 60.0
    max_tokens: int = 8000          # cumulative across all turns
    temperature: float = 0.2
    history_limit: int = 30         # prune older messages past this
    on_step: Callable[[StepLog], None] | None = None   # streaming hook

Three small dataclasses set up the contract: StepLog for the audit trail, AgentResult for the final return, AgentConfig for the knobs. None of this is theoretical — every prod agent has these or wishes it did. Build them in from day one and you save yourself the regret of stitching them in three months later when an agent loop bug ships.

The loop itself

class Agent:
    """A goal-directed loop on top of an LLM and a ToolRegistry."""

    def __init__(
        self,
        llm: LLM,
        registry: ToolRegistry,
        system_prompt: str,
        config: AgentConfig | None = None,
    ) -> None:
        self.llm = llm
        self.registry = registry
        self.system_prompt = system_prompt
        self.config = config or AgentConfig()

    def run(self, user_goal: str) -> AgentResult:
        cfg = self.config
        start = time.monotonic()
        steps: list[StepLog] = []
        total_tokens = 0

        history: list[dict] = [
            {"role": "system", "content": self.system_prompt},
            {"role": "user", "content": user_goal},
        ]

        for i in range(cfg.max_iters):
            # ── budget check ─────────────────────────────────
            elapsed = time.monotonic() - start
            if elapsed >= cfg.max_seconds:
                return self._stop(steps, "max_seconds", elapsed, total_tokens)
            if total_tokens >= cfg.max_tokens:
                return self._stop(steps, "max_tokens", elapsed, total_tokens)

            # ── prune history before sending ─────────────────
            sent = self._prune_history(history, cfg.history_limit)

            # ── single LLM call ──────────────────────────────
            t0 = time.monotonic()
            try:
                response = self.llm.chat(
                    messages=sent,
                    tools=self.registry.schemas() or None,
                    temperature=cfg.temperature,
                )
            except Exception as exc:
                steps.append(StepLog(
                    iteration=i, role="assistant",
                    content=f"[llm error: {type(exc).__name__}: {exc}]",
                    elapsed_ms=(time.monotonic() - t0) * 1000,
                ))
                return self._stop(steps, "error",
                                  time.monotonic() - start, total_tokens)

            msg = response["choices"][0]["message"]
            usage = response.get("usage") or {}
            tokens_this = usage.get("total_tokens", 0)
            total_tokens += tokens_this

            history.append(msg)
            tool_calls = msg.get("tool_calls") or []

            # ── log the assistant turn ───────────────────────
            log = StepLog(
                iteration=i, role="assistant",
                content=msg.get("content") or "",
                elapsed_ms=(time.monotonic() - t0) * 1000,
                tokens=tokens_this,
            )
            steps.append(log)
            if cfg.on_step:
                cfg.on_step(log)

            # ── done? ────────────────────────────────────────
            if not tool_calls:
                return AgentResult(
                    final=msg.get("content") or "",
                    steps=steps,
                    stop_reason="done",
                    elapsed_seconds=time.monotonic() - start,
                    total_tokens=total_tokens,
                )

            # ── execute each tool call ───────────────────────
            for tc in tool_calls:
                self._dispatch(tc, history, steps, i, cfg)

        # exhausted iterations
        return self._stop(steps, "max_iters",
                          time.monotonic() - start, total_tokens)

    # ─── helpers ────────────────────────────────────────────
    def _dispatch(self, tc, history, steps, iteration, cfg):
        name = tc["function"]["name"]
        raw_args = tc["function"]["arguments"] or "{}"
        t0 = time.monotonic()
        try:
            args = json.loads(raw_args)
            result = self.registry.call(name, args)
            content = json.dumps(result, default=str)
            err = None
        except Exception as exc:
            args = {}
            content = json.dumps({
                "error": f"{type(exc).__name__}: {exc}",
                "hint": "Check argument names and types and try again.",
            })
            err = str(exc)

        history.append({
            "role": "tool",
            "tool_call_id": tc["id"],
            "name": name,
            "content": content,
        })

        log = StepLog(
            iteration=iteration, role="tool",
            tool_name=name, tool_args=args,
            tool_result=content[:500] + ("…" if len(content) > 500 else ""),
            elapsed_ms=(time.monotonic() - t0) * 1000,
        )
        steps.append(log)
        if cfg.on_step:
            cfg.on_step(log)

    def _stop(self, steps, reason, elapsed, total_tokens) -> AgentResult:
        return AgentResult(
            final="", steps=steps, stop_reason=reason,
            elapsed_seconds=elapsed, total_tokens=total_tokens,
        )

    @staticmethod
    def _prune_history(history: list[dict], limit: int) -> list[dict]:
        """Keep the system prompt + the most recent `limit` turns."""
        if len(history) <= limit + 1:
            return history
        return [history[0]] + history[-limit:]

A loop, three budget checks, history pruning, error wrapping, and a step log. About 130 lines that does what most agent frameworks dress up across thousands.

History pruning, in detail

The naïve approach is to keep the full history. That works for a 5-turn agent. It blows up for a 50-turn one — context-window exhausted, costs through the roof, and the model gets confused by its own old reasoning that’s no longer relevant.

Three strategies, in increasing order of sophistication:

  1. Sliding window (what we did above). Keep system prompt + last N turns. Simple, works for most agents. The model loses long-term memory across the cutoff but for a 30-turn window that’s rarely the issue.
  2. Summarize-and-replace. When history exceeds a threshold, ask the model to summarize the older turns into a single “scratchpad” message, replace those turns with the summary. Better for tasks that require long-horizon memory.
  3. Vector recall. Embed each past turn into the vector store from step 07; when sending a new turn, retrieve the K most relevant past turns by cosine similarity and inject them. The most powerful, also the most code.

For step 10 we ship strategy 1. Strategy 2 is a 50-line addition once you need it; strategy 3 is its own feature.

Three real failure modes

Run an agent in production for a week and you’ll see these. Worth recognizing before they bite:

Failure 1 — Tool-call thrashing

The model calls search_docs("how to install postgres"), gets results, decides they’re “not quite right,” calls search_docs("postgres installation guide"), gets nearly-identical results, calls search_docs("postgresql setup"), etc. Eight iterations of the same search with synonym variations.

Mitigation: cache tool results within a single agent run. If the model calls a tool with arguments it’s already used, return the cached result with a hint: "You've already called this; consider a different approach." 10 lines of code; saves a lot of money.

Failure 2 — Premature giving-up

Especially with smaller models. After two failed tool calls the model produces “I’m sorry, I wasn’t able to find the information you requested” and stops. The user is annoyed; the answer was findable.

Mitigation: in your system prompt, explicitly grant the agent patience — “Try at least three different approaches before giving up. Tools are cheap; the user prefers a slow correct answer to a fast wrong one.” Counterintuitively, telling the model to be persistent improves quality on multi-step tasks more than nearly any other prompt change.

Failure 3 — Format drift on long runs

After 15+ tool calls, the model’s response format starts degrading. Tool calls become malformed, the assistant message gets verbose, the schema gets ignored. Symptom of context exhaustion or temperature drift.

Mitigation: aggressive history pruning (smaller history_limit), low temperature (0.00.2), and a periodic refresh — every 10 turns, inject a one-line system reminder: “Remember your tools and their schemas.” Sounds silly; works.

The runner script

# stack/agent.py (continued)
from stack.tools import ToolRegistry, tool_from_callable
from stack.tools import now, search_docs


def fetch_chunk(chunk_id: str) -> dict:
    """Fetch the full text of a chunk by its ID.

    Args:
        chunk_id: The ID returned by search_docs.
    """
    # Stub — wire to VectorStore.get(chunk_id) in real use.
    return {"id": chunk_id, "text": f"Full text for chunk {chunk_id}…"}


SYSTEM_PROMPT = """\
You are a research assistant with access to a knowledge base.

When asked a question:
1. Search the knowledge base with `search_docs`.
2. If a result looks promising, fetch its full text with `fetch_chunk`.
3. Try at least three different search angles before giving up.
4. Cite chunk IDs in your final answer.

Tools are cheap; the user prefers a slow correct answer to a fast wrong one.
"""


if __name__ == "__main__":
    llm = LLM()
    registry = ToolRegistry()
    for fn in (now, search_docs, fetch_chunk):
        registry.register(tool_from_callable(fn))

    agent = Agent(llm, registry, SYSTEM_PROMPT)
    result = agent.run(
        "How do I set up the database for this project? "
        "I want exact commands."
    )

    print(f"\n=== {result.stop_reason} after {result.elapsed_seconds:.1f}s, "
          f"{result.total_tokens} tokens, {len(result.steps)} steps ===\n")
    for s in result.steps:
        if s.role == "assistant" and s.content:
            print(f"  [{s.iteration}] thought: {s.content[:120]}…")
        if s.role == "tool":
            print(f"  [{s.iteration}] {s.tool_name}({s.tool_args}) "
                  f"-> {s.tool_result[:80]}…")
    print(f"\nFINAL:\n{result.final}\n")

Run it:

uv run python -m stack.agent

Expected output (Llama-3.1-8B, varies by run):

=== done after 14.3s, 2841 tokens, 7 steps ===

  [0] thought: I'll search the knowledge base for database setup instructions for this project…
  [0] search_docs({'query': 'database setup', 'top_k': 5}) -> [{"id": "doc-001", …
  [1] fetch_chunk({'chunk_id': 'doc-001'}) -> {"id": "doc-001", "text": "Full text for chunk doc-001…"}…
  [2] thought: That gives me partial info. Let me search for the specific install commands…
  [2] search_docs({'query': 'postgresql install commands ubuntu', 'top_k': 5}) -> …
  [3] fetch_chunk({'chunk_id': 'doc-014'}) -> …
  [4] thought: Now I have enough to answer. Let me synthesize…

FINAL:
To set up the database for this project, run the following commands:

1. Install PostgreSQL: `sudo apt install postgresql-15`
2. Create the database: `createdb myproject`
3. Run migrations: `python manage.py migrate`

Sources: doc-001, doc-014.

Three observations:

  1. The agent did exactly what you’d want: searched, fetched, searched again, synthesized.
  2. Most of the wall-clock time is the LLM, not the tools. Step latency is dominated by generation, not retrieval.
  3. The step log is your best friend. When an agent fails, you’ll re-read this log to figure out what the model was thinking. Build it in.

Cross-references

What we did and didn’t do

What we did:

  • An Agent class with three-axis budgets (iterations, seconds, tokens)
  • A typed StepLog and AgentResult so you can audit and replay every run
  • History pruning so long agents don’t blow the context window
  • A dispatch path that wraps tool errors and feeds them back to the model
  • A real runner script with three tools and a research question

What we didn’t:

  • Reflection. A separate “critique” call after the agent’s final answer that asks the model “is this actually right?” Big quality lift on hard tasks; ~30 extra lines. Worth adding once you’ve benchmarked the base agent (step 13).
  • Planning. A planning step that decomposes the user goal into sub-tasks before the loop starts. Hot research area; useful for very long agents (50+ turns). Premature for most apps.
  • Tool-call caching. Mentioned as a failure mitigation. Implement when you see thrashing, not before.
  • Multi-agent. Two or more agents talking to each other. That’s step 11.

Next

Step 11 is multi-agent orchestration — coordinating two or more agents so they handle parts of a task each is best at: a planner that breaks the problem down, workers that execute, a critic that reviews. We’ll see why most multi-agent systems are over-engineered, where the pattern actually pays off, and how to build a small, opinionated orchestrator that doesn’t become its own framework.