Tool Use & Function Calling

Tools are how an agent reaches out to the world: search APIs, databases, file systems, code execution, your internal services. The model sees a list of available tools; decides when and how to call them; integrates the results.

Native tool use

Modern APIs (Anthropic, OpenAI, Gemini) support tool use natively:

tools = [{
    "name": "get_weather",
    "description": "Get the current weather in a given location",
    "input_schema": {
        "type": "object",
        "properties": {
            "location": {"type": "string", "description": "City name"},
            "unit": {"type": "string", "enum": ["celsius", "fahrenheit"]},
        },
        "required": ["location"],
    },
}]

response = client.messages.create(
    model="claude-sonnet-4-6",
    tools=tools,
    messages=[{"role": "user", "content": "What's the weather in Paris?"}],
)

The model decides whether to use the tool. If it does, the response contains a tool_use block with the name and arguments — validated against your schema.

Designing good tools

Tool design is API design. Same principles, plus:

One tool, one job

Don’t make a tool that does five things based on a mode parameter. Make five tools.

Descriptive names

search_jira_tickets is better than query. The model reads names + descriptions to decide which tool to call.

Rich descriptions

The description is your prompt for the model:

"Search Jira tickets by free text or JQL. Use this when the user mentions
a ticket, project, bug, or feature request. Returns up to 10 matching
tickets with key, summary, status, and assignee. For specific known
ticket IDs, prefer get_jira_ticket instead."

Include:

What it does.
When to use it.
When NOT to use it (point to better alternatives).
What it returns.

Tight schemas

Use enum wherever you have a small set of values.
Use required for parameters that aren’t optional.
Provide concrete description for each parameter.
Constrain types (integer vs number, string with format for dates).

Idempotent where possible

Tools that can be retried without side effects are easier to handle. If a tool is destructive (sends email, charges cards), say so explicitly in the description.

Good error messages

When a tool fails, return a message the model can act on:

Bad: "Error: 500 internal server error" Good: "The user_id 'user_xyz' does not exist. Available users can be listed with list_users."

The model often recovers gracefully if the error suggests a fix.

Common tool categories

Information retrieval

search_web(query)
search_kb(query) (your internal RAG)
read_url(url)
get_doc(doc_id)

Computation

calculator(expression)
execute_code(code) — sandboxed!
run_sql(query, database)

Actions

send_email(to, subject, body)
create_ticket(...)
update_record(...)

Memory / state

save_note(content)
recall_notes(query)
set_user_preference(...)

Tool selection strategies

Few tools = better selection

5–10 tools is the sweet spot. Beyond ~30, even strong models start picking the wrong tool.

If you have many tools, consider:

Hierarchical tools: a top-level search tool dispatches to specialized retrievers.
Tool routing: a separate cheap LLM picks 5 relevant tools, full agent uses those.
MCP servers (model context protocol): structured way to attach external tools, with tool filtering.

Dynamic tool sets

Different parts of a task may need different tools. Some patterns:

Different agents with different tool sets in a multi-agent system.
A meta-tool enable_tools(category) that exposes more tools.
Tools that “unlock” others by their results (tool A returns tool B’s URL).

Code execution as a tool

execute_code is wildly powerful — the model writes Python (or any language) to solve sub-problems instead of having a tool for everything.

Pros:

Generalizes to anything Python can do (math, parsing, transforms).
Reduces tool sprawl.
Often cheaper than a dozen specialized tools.

Cons:

Sandboxing is critical. Use Docker, Firecracker microVMs, gVisor, or a hosted code interpreter (E2B, Modal, Daytona).
Output size: long output can blow context.
Errors require the model to debug.

Anthropic’s Claude Code, OpenAI’s Code Interpreter, Cursor — all heavily use code execution.

Tool-call validation

Before executing:

def execute(tool_call):
    if tool_call.name not in handlers:
        return f"Unknown tool: {tool_call.name}"

    schema = tools_by_name[tool_call.name].input_schema
    try:
        validated = jsonschema.validate(tool_call.input, schema)
    except jsonschema.ValidationError as e:
        return f"Invalid input: {e.message}"

    try:
        return handlers[tool_call.name](**tool_call.input)
    except Exception as e:
        log.exception(...)
        return f"Tool error: {e}"

For provider-strict tools (Anthropic, OpenAI strict mode), schema validation is enforced — but defense in depth never hurts.

Parallel tool calls

Modern APIs let the model call multiple tools in parallel:

# Response contains multiple tool_use blocks; execute concurrently
results = await asyncio.gather(*[call_tool(tc) for tc in tool_calls])

Saves latency dramatically when tools are independent. Important for performant agents.

MCP (Model Context Protocol)

A 2024 protocol from Anthropic for standardized tool servers. An MCP server exposes tools, prompts, and resources over a JSON-RPC interface; clients (Claude, IDE plugins, etc.) connect to it.

Benefits:

Decouple tool implementation from agent code.
Reuse tool servers across multiple agents/products.
Marketplace of community-built servers (search, databases, dev tools).

Adoption growing fast in late 2025/2026.

Auth and secrets

Tools that need credentials:

Per-user: pass tokens through context, not as tool inputs (prevents the model from leaking them).
Service tokens: keep them server-side; the agent calls a wrapper that injects the token.
OAuth flows: agent triggers a flow; user authenticates out-of-band; agent resumes.

Never let the model see API keys in the prompt or as tool arguments.

Handling long tool outputs

A single tool call can return megabytes (a long document, a big query result). Strategies:

Pagination: tools return chunks; the model can ask for more.
Summarization: tool result is summarized before being added to context.
External storage: tool returns an ID; further tools read details by ID.
Truncation with notice: “[Output truncated to 1000 tokens. Call get_more for the rest.]”

The agent’s working memory is precious; don’t waste it on unfiltered tool output.

Tool use evaluation

How do you know your tools work?

Functional tests: each tool has unit tests against its handler.
Selection tests: given an instruction, does the model pick the right tool?
End-to-end traces: log every tool call in production; review weekly.
Invalid call rate: % of tool calls that fail schema validation.
Recovery rate: when a tool errors, does the agent recover?

Pitfalls

Vague descriptions → the model picks the wrong tool.
Overlapping tools → analysis paralysis; the model dithers.
No error feedback → the model retries the same broken call.
Side effects without warning → the model accidentally sends destructive actions.
Unbounded tool output → context blows up, model gets lost.
Expensive tools called freely → cost explosion.

Watch it interactively

Tool Use Builder — pick a scenario, watch the model emit a JSON tool call, see the runtime execute it, see the result feed back. The whole protocol made explicit.
Structured Outputs — the schema-as-contract foundation tool calling rests on. Edit the response live; watch a JSON-schema validator catch every kind of violation.
Agent Trace Viewer — full trace with the failure-injection toggle showing how tool errors get handled (transient retry vs permanent give-up).

Build it in code

/ship/09 — tools and function calling — Tool/ToolRegistry, JSON-schema auto-generated from Python type hints + Google-style docstrings, OSS-model adapters, the propose-then-act pattern. ~180 lines.
/case-studies/02 — code-review agent — propose-then-act tools in a real product (the queue → wrapper-commits pattern that prevents agent rogue actions).