production-stack building 09 / 17 22 min read · 30 min hands-on

step 09 · ship · building

Tools and function calling

JSON-schema tool definitions, OSS-model adapters, and the structured-output dance that makes agents possible.

toolsagents

So far the model is a closed box. You feed it text, it produces text. That’s enough for a chatbot, but not for anything that has to do something — read your retrieval index, look up today’s date, call your billing API, run a SQL query.

Tools are the bridge. The contract is small and elegant: you give the model a list of function signatures (name, description, JSON-schema arguments) on every turn. When the model wants to call one, instead of producing prose it produces a structured JSON object: {"name": "search_docs", "arguments": {"query": "..."}}. Your code parses that, runs the function, returns the result as a new message, and the model takes it from there.

Every “agent” framework you’ve heard of — LangChain agents, AutoGen, OpenAI’s Assistants, Anthropic’s tool-use — is a thin loop on top of this primitive. By the end of step 09 you’ll have written that primitive yourself. Step 10 will close the loop.

The tool-call protocol

A single round-trip looks like this:

  1. Client → model. Send the conversation plus a tools=[...] list.
  2. Model → client. Either (a) plain text response, or (b) a structured tool-call: {"name": "...", "arguments": {...}}.
  3. Client. If it’s a tool call, execute the function, append both the tool-call message and the tool-result message to the conversation.
  4. Client → model. Send the updated conversation back; ask for the next turn.
  5. Repeat until the model produces a plain-text response.

That’s it. The “agent loop” in step 10 is literally while last_message.is_tool_call: ....

Setup

No new dependencies. We’ll use Python’s built-in inspect and typing modules to derive the schema from function signatures.

# nothing to install — pure stdlib + jsonschema (already pulled in by FastAPI)

Open the new file:

# stack/tools.py
from __future__ import annotations
import inspect
import json
import re
from dataclasses import dataclass
from typing import Any, Callable, get_type_hints, Union, get_origin, get_args

A Tool registry

A tool is a Python callable plus its JSON-schema. We’ll wrap it in a small dataclass that knows how to render itself for the API:

@dataclass
class Tool:
    name: str
    description: str
    parameters: dict          # JSON-schema for the arguments
    fn: Callable              # the actual Python callable

    def to_openai_schema(self) -> dict:
        """The shape OpenAI / vLLM / Ollama accept on the tools= list."""
        return {
            "type": "function",
            "function": {
                "name": self.name,
                "description": self.description,
                "parameters": self.parameters,
            },
        }


class ToolRegistry:
    """Holds tools by name. Resolves tool calls."""

    def __init__(self) -> None:
        self._tools: dict[str, Tool] = {}

    def register(self, tool: Tool) -> None:
        self._tools[tool.name] = tool

    def schemas(self) -> list[dict]:
        """The list to pass as `tools=...` in the chat request."""
        return [t.to_openai_schema() for t in self._tools.values()]

    def call(self, name: str, arguments: dict) -> Any:
        """Run a tool by name with parsed arguments. Raises KeyError if unknown."""
        tool = self._tools[name]
        return tool.fn(**arguments)

Two responsibilities: keep the schemas synced with the live functions, and dispatch incoming tool calls to the right Python callable. We’ll never want those two to drift out of sync, which is why both are derived from the same source — coming up next.

Schema from type hints

Writing JSON-schema by hand is tedious and a major source of bugs. We’ll generate it from the function’s signature and docstring:

# Map Python types → JSON-schema types
_PY_TO_JSON = {
    str: "string",
    int: "integer",
    float: "number",
    bool: "boolean",
}


def _type_to_schema(t: Any) -> dict:
    """Convert a Python type hint to a JSON-schema fragment."""
    # list[X] → {"type": "array", "items": ...}
    if get_origin(t) is list:
        (inner,) = get_args(t)
        return {"type": "array", "items": _type_to_schema(inner)}
    # X | None → optional via {"anyOf": [...]}
    if get_origin(t) is Union:
        non_none = [a for a in get_args(t) if a is not type(None)]
        if len(non_none) == 1:
            return _type_to_schema(non_none[0])
        return {"anyOf": [_type_to_schema(a) for a in non_none]}
    # Primitive
    if t in _PY_TO_JSON:
        return {"type": _PY_TO_JSON[t]}
    # Fallback: treat as a string. Good enough for prototypes; extend
    # the table for dataclasses, enums, etc. as you need them.
    return {"type": "string"}


# Tiny docstring parser: pulls description and per-arg comments
# from a Google-style docstring.
#
#   """Search the knowledge base.
#
#   Args:
#       query: The natural-language query.
#       top_k: Number of results to return.
#   """
_ARG_RE = re.compile(r"^\s+(\w+):\s*(.+)$")


def _parse_docstring(doc: str | None) -> tuple[str, dict[str, str]]:
    if not doc:
        return "", {}
    lines = doc.strip().splitlines()
    description = lines[0].strip()
    arg_descs: dict[str, str] = {}
    in_args = False
    for ln in lines[1:]:
        if ln.strip().lower().startswith("args:"):
            in_args = True
            continue
        if in_args:
            m = _ARG_RE.match(ln)
            if m:
                arg_descs[m.group(1)] = m.group(2).strip()
            elif ln.strip() == "":
                continue
            else:
                in_args = False
    return description, arg_descs


def tool_from_callable(fn: Callable) -> Tool:
    """Reflect a Python function into a Tool with auto-generated schema."""
    sig = inspect.signature(fn)
    hints = get_type_hints(fn)
    description, arg_descs = _parse_docstring(fn.__doc__)

    properties: dict[str, dict] = {}
    required: list[str] = []
    for arg_name, param in sig.parameters.items():
        annotation = hints.get(arg_name, str)
        schema = _type_to_schema(annotation)
        if arg_name in arg_descs:
            schema["description"] = arg_descs[arg_name]
        properties[arg_name] = schema
        if param.default is inspect.Parameter.empty:
            required.append(arg_name)

    parameters = {"type": "object", "properties": properties}
    if required:
        parameters["required"] = required

    return Tool(
        name=fn.__name__,
        description=description,
        parameters=parameters,
        fn=fn,
    )

Now writing a tool is just writing a normal Python function:

def search_docs(query: str, top_k: int = 5) -> list[dict]:
    """Search the knowledge base for relevant chunks.

    Args:
        query: The natural-language search query.
        top_k: Number of results to return.
    """
    # Hooked up to the HybridRetriever from step 08 in real usage.
    return [{"id": "stub", "text": f"results for {query!r}", "score": 1.0}]

…and registering it:

registry = ToolRegistry()
registry.register(tool_from_callable(search_docs))

The schema, the description, the required list — all derived. No drift possible.

The chat loop with tools

A single tool-using turn against an OpenAI-compatible endpoint:

# stack/tools.py (continued)
from stack.llm import LLM


def call_with_tools(
    llm: LLM,
    messages: list[dict],
    registry: ToolRegistry,
    max_iters: int = 8,
    temperature: float = 0.2,
) -> list[dict]:
    """Run a chat completion with tool support.

    Loops until the model returns a plain-text response or `max_iters`
    is hit. Returns the full message history so the caller can decide
    what to do with intermediate steps.
    """
    history = list(messages)
    for _ in range(max_iters):
        response = llm.chat(
            messages=history,
            tools=registry.schemas() or None,
            temperature=temperature,
        )
        msg = response["choices"][0]["message"]
        history.append(msg)

        # If the model didn't ask for a tool, we're done.
        tool_calls = msg.get("tool_calls") or []
        if not tool_calls:
            return history

        # Execute each tool call and append its result.
        for tc in tool_calls:
            name = tc["function"]["name"]
            try:
                args = json.loads(tc["function"]["arguments"] or "{}")
                result = registry.call(name, args)
                content = json.dumps(result, default=str)
            except Exception as exc:
                # Hand the error back to the model — it'll usually retry
                # or apologize gracefully.
                content = json.dumps({"error": f"{type(exc).__name__}: {exc}"})

            history.append({
                "role": "tool",
                "tool_call_id": tc["id"],
                "name": name,
                "content": content,
            })

    return history

Key design choices:

  • The registry owns dispatch. The loop never imports tool functions directly — it just calls registry.call(name, args). That decoupling matters when tools come from plugins or a database.
  • Errors go back to the model. If JSON parsing fails or the function raises, we wrap the error in the tool result. The model is good at recovering (“Sorry, let me try again with the correct arguments”) and you get observability for free.
  • A budget on iterations. max_iters=8 is a safety net for a model stuck in a tool-calling loop. We’ll formalize this in step 10’s agent layer.
  • Lower temperature. Tool calls are structured output; higher temperatures hurt schema adherence. 0.2 is a sane default; some teams go to 0.0 for tools.

OSS model adapters

Here’s where it gets interesting. The OpenAI tool-calling format is de jure the standard, but each OSS model speaks it slightly differently:

  • Llama-3.1 / Llama-3.2 (Instruct). Native tool support via the official chat template. Ollama and vLLM both handle the formatting transparently — you pass tools=[...] and they emit tool_calls in the response.
  • Qwen-2.5-Instruct. Native tool support via Hermes-style templates. Works out of the box with vLLM 0.6+; Ollama has it from 0.4+.
  • Mistral-Instruct (v0.3+). Native tool support, but the tool-result message expects a tool_call_id field that some Mistral builds ignore. Test before relying on multi-turn tool loops.
  • Phi-3 / Phi-3.5. No native tool support. You have to prompt-engineer it: emit a system message that says “When you want to call a tool, output JSON in this exact format: …” and parse the model’s text output yourself.
  • Older Llama-2 / Mistral-7B-v0.1. Same as Phi: prompt-engineering only.

For the curriculum we use Llama-3.1-8B-Instruct, which works natively. If you need a tool-calling fallback for older models, here’s the prompt-engineering escape hatch:

# stack/tools.py (continued)
PROMPT_TEMPLATE_FALLBACK = """\
You have access to the following tools. To call a tool, output ONLY
a JSON object on a single line in this exact format:

{{"tool_call": {{"name": "<tool_name>", "arguments": {{...}}}}}}

After the tool runs, you'll receive the result and can respond to the
user. Otherwise, respond normally.

Available tools:
{tools_block}
"""


def render_fallback_system_prompt(registry: ToolRegistry) -> str:
    """For models without native tool support."""
    blocks = []
    for tool in registry._tools.values():
        blocks.append(
            f"- {tool.name}{json.dumps(tool.parameters)}\n"
            f"  {tool.description}"
        )
    return PROMPT_TEMPLATE_FALLBACK.format(tools_block="\n".join(blocks))


_FALLBACK_RE = re.compile(r'\{"tool_call":\s*\{.*\}\s*\}', re.DOTALL)


def parse_fallback_tool_call(text: str) -> dict | None:
    """Extract a tool call from a free-text response. Returns None if not present."""
    m = _FALLBACK_RE.search(text)
    if not m:
        return None
    try:
        obj = json.loads(m.group(0))
        return obj.get("tool_call")
    except json.JSONDecodeError:
        return None

Skip the fallback path for the main curriculum. We mention it because somebody, somewhere on your team will be trying to put Phi-3 in a low-RAM environment and will need it.

The sanity check

A two-tool runner script. We give the model search_docs (stubbed) and now (returns the current ISO timestamp), then ask a question that requires both:

# stack/tools.py (continued)
import datetime as dt

def now() -> str:
    """Return the current UTC time in ISO-8601 format.

    Args:
    """
    return dt.datetime.now(dt.timezone.utc).isoformat()


def search_docs(query: str, top_k: int = 5) -> list[dict]:
    """Search the knowledge base for relevant chunks.

    Args:
        query: The natural-language search query.
        top_k: Number of results to return.
    """
    # Stub. Plug in HybridRetriever from step 08 for real use.
    return [
        {"id": "doc-001", "text": f"Results matching {query!r}.", "score": 0.84},
        {"id": "doc-014", "text": "PostgreSQL setup notes.", "score": 0.71},
    ]


if __name__ == "__main__":
    llm = LLM()
    registry = ToolRegistry()
    registry.register(tool_from_callable(now))
    registry.register(tool_from_callable(search_docs))

    messages = [
        {"role": "system", "content":
            "You are a helpful assistant with two tools. Use them when relevant."},
        {"role": "user", "content":
            "What time is it, and search the docs for 'database connection strings'."},
    ]

    history = call_with_tools(llm, messages, registry)

    print("\n=== conversation ===\n")
    for m in history:
        role = m["role"]
        if role == "tool":
            print(f"[tool: {m['name']}]\n{m['content'][:200]}\n")
        elif m.get("tool_calls"):
            for tc in m["tool_calls"]:
                fn = tc["function"]
                print(f"[assistant calls {fn['name']}({fn['arguments']})]")
        else:
            print(f"[{role}]\n  {m.get('content', '')[:400]}\n")

Run it:

uv run python -m stack.tools

Expected output (Llama-3.1-8B via Ollama; phrasing will vary):

=== conversation ===

[system]
  You are a helpful assistant with two tools. Use them when relevant.

[user]
  What time is it, and search the docs for 'database connection strings'.

[assistant calls now({})]
[assistant calls search_docs({"query": "database connection strings", "top_k": 5})]

[tool: now]
  → "2026-04-30T18:42:11.043912+00:00"

[tool: search_docs]
  → [{"id": "doc-001", "text": "Results matching 'database connection strings'."

[assistant]
  The current time is 2026-04-30T18:42:11Z. I searched the docs for "database
  connection strings" and found two relevant entries: doc-001 (results matching
  your query) and doc-014 (PostgreSQL setup notes). Want me to fetch the full
  contents of either?

What just happened, in order:

  1. The user asked a single question with two intents.
  2. The model emitted two tool calls in parallel (Llama-3.1 supports multi-call). Some models will serialize them; both work.
  3. Your code executed each, appended the results.
  4. The model digested both results and produced a natural-language response.

That’s the whole thing. Every agent you’ll ever build is this loop, with more tools and more guardrails.

Cross-references

What we did and didn’t do

What we did:

  • A Tool and ToolRegistry that hold the API schemas and the dispatch in one place
  • Auto-generated JSON-schema from Python type hints + Google-style docstrings
  • A call_with_tools loop that runs the dance until the model produces a plain response
  • A fallback prompt-engineering path for OSS models without native tool support
  • An end-to-end demo with two tools running in parallel

What we didn’t:

  • Strict JSON-schema validation. A real prod loop validates arguments against the schema before invoking the function (using jsonschema.validate). Skipped for brevity; ~10 lines to add.
  • Streaming tool calls. Some models stream the tool-call JSON token-by-token. Useful for long-arg tools (e.g. SQL query generation) where you want to start showing the user something. Adds complexity; defer until you need it.
  • Concurrent tool execution. When the model emits multiple tool calls, we run them serially. asyncio.gather would parallelize them — important for I/O-bound tools (network requests). One-line change once you make Tool.fn an async callable.
  • Authorization. Tools run with whatever permissions your process has. For multi-user services, gate sensitive tools behind a per-user permission check — done in the dispatcher, not the tool itself.

Next

Step 10 is build an agent loop — the formal while not done: think → act → observe structure that turns the tool primitive we just wrote into a goal-directed agent. We’ll add a budget, a planning step, retry handling, and the message-history pruning that keeps long agents from blowing past the context window. By the end, you’ll have a single agent that can answer multi-step research questions on top of your RAG stack.