Tool Use & Function Calling

Tools are how an agent reaches out to the world: search APIs, databases, file systems, code execution, your internal services. The model sees a list of available tools; decides when and how to call them; integrates the results.

Native tool use

Modern APIs (Anthropic, OpenAI, Gemini) support tool use natively:

tools = [{
    "name": "get_weather",
    "description": "Get the current weather in a given location",
    "input_schema": {
        "type": "object",
        "properties": {
            "location": {"type": "string", "description": "City name"},
            "unit": {"type": "string", "enum": ["celsius", "fahrenheit"]},
        },
        "required": ["location"],
    },
}]

response = client.messages.create(
    model="claude-sonnet-4-6",
    tools=tools,
    messages=[{"role": "user", "content": "What's the weather in Paris?"}],
)

The model decides whether to use the tool. If it does, the response contains a tool_use block with the name and arguments — validated against your schema.

Designing good tools

Tool design is API design. Same principles, plus:

One tool, one job

Don’t make a tool that does five things based on a mode parameter. Make five tools.

Descriptive names

search_jira_tickets is better than query. The model reads names + descriptions to decide which tool to call.

Rich descriptions

The description is your prompt for the model:

"Search Jira tickets by free text or JQL. Use this when the user mentions
a ticket, project, bug, or feature request. Returns up to 10 matching
tickets with key, summary, status, and assignee. For specific known
ticket IDs, prefer get_jira_ticket instead."

Include:

  • What it does.
  • When to use it.
  • When NOT to use it (point to better alternatives).
  • What it returns.

Tight schemas

  • Use enum wherever you have a small set of values.
  • Use required for parameters that aren’t optional.
  • Provide concrete description for each parameter.
  • Constrain types (integer vs number, string with format for dates).

Idempotent where possible

Tools that can be retried without side effects are easier to handle. If a tool is destructive (sends email, charges cards), say so explicitly in the description.

Good error messages

When a tool fails, return a message the model can act on:

Bad: "Error: 500 internal server error" Good: "The user_id 'user_xyz' does not exist. Available users can be listed with list_users."

The model often recovers gracefully if the error suggests a fix.

Common tool categories

Information retrieval

  • search_web(query)
  • search_kb(query) (your internal RAG)
  • read_url(url)
  • get_doc(doc_id)

Computation

  • calculator(expression)
  • execute_code(code) — sandboxed!
  • run_sql(query, database)

Actions

  • send_email(to, subject, body)
  • create_ticket(...)
  • update_record(...)

Memory / state

  • save_note(content)
  • recall_notes(query)
  • set_user_preference(...)

Meta

  • ask_user(question) — escalate ambiguity to the human
  • finalize(answer) — explicit termination

Tool selection strategies

Few tools = better selection

5–10 tools is the sweet spot. Beyond ~30, even strong models start picking the wrong tool.

If you have many tools, consider:

  • Hierarchical tools: a top-level search tool dispatches to specialized retrievers.
  • Tool routing: a separate cheap LLM picks 5 relevant tools, full agent uses those.
  • MCP servers (model context protocol): structured way to attach external tools, with tool filtering.

Dynamic tool sets

Different parts of a task may need different tools. Some patterns:

  • Different agents with different tool sets in a multi-agent system.
  • A meta-tool enable_tools(category) that exposes more tools.
  • Tools that “unlock” others by their results (tool A returns tool B’s URL).

Code execution as a tool

execute_code is wildly powerful — the model writes Python (or any language) to solve sub-problems instead of having a tool for everything.

Pros:

  • Generalizes to anything Python can do (math, parsing, transforms).
  • Reduces tool sprawl.
  • Often cheaper than a dozen specialized tools.

Cons:

  • Sandboxing is critical. Use Docker, Firecracker microVMs, gVisor, or a hosted code interpreter (E2B, Modal, Daytona).
  • Output size: long output can blow context.
  • Errors require the model to debug.

Anthropic’s Claude Code, OpenAI’s Code Interpreter, Cursor — all heavily use code execution.

Tool-call validation

Before executing:

def execute(tool_call):
    if tool_call.name not in handlers:
        return f"Unknown tool: {tool_call.name}"

    schema = tools_by_name[tool_call.name].input_schema
    try:
        validated = jsonschema.validate(tool_call.input, schema)
    except jsonschema.ValidationError as e:
        return f"Invalid input: {e.message}"

    try:
        return handlers[tool_call.name](**tool_call.input)
    except Exception as e:
        log.exception(...)
        return f"Tool error: {e}"

For provider-strict tools (Anthropic, OpenAI strict mode), schema validation is enforced — but defense in depth never hurts.

Parallel tool calls

Modern APIs let the model call multiple tools in parallel:

# Response contains multiple tool_use blocks; execute concurrently
results = await asyncio.gather(*[call_tool(tc) for tc in tool_calls])

Saves latency dramatically when tools are independent. Important for performant agents.

MCP (Model Context Protocol)

A 2024 protocol from Anthropic for standardized tool servers. An MCP server exposes tools, prompts, and resources over a JSON-RPC interface; clients (Claude, IDE plugins, etc.) connect to it.

Benefits:

  • Decouple tool implementation from agent code.
  • Reuse tool servers across multiple agents/products.
  • Marketplace of community-built servers (search, databases, dev tools).

Adoption growing fast in late 2025/2026.

Auth and secrets

Tools that need credentials:

  • Per-user: pass tokens through context, not as tool inputs (prevents the model from leaking them).
  • Service tokens: keep them server-side; the agent calls a wrapper that injects the token.
  • OAuth flows: agent triggers a flow; user authenticates out-of-band; agent resumes.

Never let the model see API keys in the prompt or as tool arguments.

Handling long tool outputs

A single tool call can return megabytes (a long document, a big query result). Strategies:

  • Pagination: tools return chunks; the model can ask for more.
  • Summarization: tool result is summarized before being added to context.
  • External storage: tool returns an ID; further tools read details by ID.
  • Truncation with notice: “[Output truncated to 1000 tokens. Call get_more for the rest.]”

The agent’s working memory is precious; don’t waste it on unfiltered tool output.

Tool use evaluation

How do you know your tools work?

  • Functional tests: each tool has unit tests against its handler.
  • Selection tests: given an instruction, does the model pick the right tool?
  • End-to-end traces: log every tool call in production; review weekly.
  • Invalid call rate: % of tool calls that fail schema validation.
  • Recovery rate: when a tool errors, does the agent recover?

Pitfalls

  • Vague descriptions → the model picks the wrong tool.
  • Overlapping tools → analysis paralysis; the model dithers.
  • No error feedback → the model retries the same broken call.
  • Side effects without warning → the model accidentally sends destructive actions.
  • Unbounded tool output → context blows up, model gets lost.
  • Expensive tools called freely → cost explosion.

Watch it interactively

  • Tool Use Builder — pick a scenario, watch the model emit a JSON tool call, see the runtime execute it, see the result feed back. The whole protocol made explicit.
  • Structured Outputs — the schema-as-contract foundation tool calling rests on. Edit the response live; watch a JSON-schema validator catch every kind of violation.
  • Agent Trace Viewer — full trace with the failure-injection toggle showing how tool errors get handled (transient retry vs permanent give-up).

Build it in code

See also