Tool Use & Function Calling
Tools are how an agent reaches out to the world: search APIs, databases, file systems, code execution, your internal services. The model sees a list of available tools; decides when and how to call them; integrates the results.
Native tool use
Modern APIs (Anthropic, OpenAI, Gemini) support tool use natively:
tools = [{
"name": "get_weather",
"description": "Get the current weather in a given location",
"input_schema": {
"type": "object",
"properties": {
"location": {"type": "string", "description": "City name"},
"unit": {"type": "string", "enum": ["celsius", "fahrenheit"]},
},
"required": ["location"],
},
}]
response = client.messages.create(
model="claude-sonnet-4-6",
tools=tools,
messages=[{"role": "user", "content": "What's the weather in Paris?"}],
)
The model decides whether to use the tool. If it does, the response contains a tool_use block with the name and arguments — validated against your schema.
Designing good tools
Tool design is API design. Same principles, plus:
One tool, one job
Don’t make a tool that does five things based on a mode parameter. Make five tools.
Descriptive names
search_jira_tickets is better than query. The model reads names + descriptions to decide which tool to call.
Rich descriptions
The description is your prompt for the model:
"Search Jira tickets by free text or JQL. Use this when the user mentions
a ticket, project, bug, or feature request. Returns up to 10 matching
tickets with key, summary, status, and assignee. For specific known
ticket IDs, prefer get_jira_ticket instead."
Include:
- What it does.
- When to use it.
- When NOT to use it (point to better alternatives).
- What it returns.
Tight schemas
- Use
enumwherever you have a small set of values. - Use
requiredfor parameters that aren’t optional. - Provide concrete
descriptionfor each parameter. - Constrain types (
integervsnumber,stringwithformatfor dates).
Idempotent where possible
Tools that can be retried without side effects are easier to handle. If a tool is destructive (sends email, charges cards), say so explicitly in the description.
Good error messages
When a tool fails, return a message the model can act on:
Bad: "Error: 500 internal server error"
Good: "The user_id 'user_xyz' does not exist. Available users can be listed with list_users."
The model often recovers gracefully if the error suggests a fix.
Common tool categories
Information retrieval
search_web(query)search_kb(query)(your internal RAG)read_url(url)get_doc(doc_id)
Computation
calculator(expression)execute_code(code)— sandboxed!run_sql(query, database)
Actions
send_email(to, subject, body)create_ticket(...)update_record(...)
Memory / state
save_note(content)recall_notes(query)set_user_preference(...)
Meta
ask_user(question)— escalate ambiguity to the humanfinalize(answer)— explicit termination
Tool selection strategies
Few tools = better selection
5–10 tools is the sweet spot. Beyond ~30, even strong models start picking the wrong tool.
If you have many tools, consider:
- Hierarchical tools: a top-level
searchtool dispatches to specialized retrievers. - Tool routing: a separate cheap LLM picks 5 relevant tools, full agent uses those.
- MCP servers (model context protocol): structured way to attach external tools, with tool filtering.
Dynamic tool sets
Different parts of a task may need different tools. Some patterns:
- Different agents with different tool sets in a multi-agent system.
- A meta-tool
enable_tools(category)that exposes more tools. - Tools that “unlock” others by their results (tool A returns tool B’s URL).
Code execution as a tool
execute_code is wildly powerful — the model writes Python (or any language) to solve sub-problems instead of having a tool for everything.
Pros:
- Generalizes to anything Python can do (math, parsing, transforms).
- Reduces tool sprawl.
- Often cheaper than a dozen specialized tools.
Cons:
- Sandboxing is critical. Use Docker, Firecracker microVMs, gVisor, or a hosted code interpreter (E2B, Modal, Daytona).
- Output size: long output can blow context.
- Errors require the model to debug.
Anthropic’s Claude Code, OpenAI’s Code Interpreter, Cursor — all heavily use code execution.
Tool-call validation
Before executing:
def execute(tool_call):
if tool_call.name not in handlers:
return f"Unknown tool: {tool_call.name}"
schema = tools_by_name[tool_call.name].input_schema
try:
validated = jsonschema.validate(tool_call.input, schema)
except jsonschema.ValidationError as e:
return f"Invalid input: {e.message}"
try:
return handlers[tool_call.name](**tool_call.input)
except Exception as e:
log.exception(...)
return f"Tool error: {e}"
For provider-strict tools (Anthropic, OpenAI strict mode), schema validation is enforced — but defense in depth never hurts.
Parallel tool calls
Modern APIs let the model call multiple tools in parallel:
# Response contains multiple tool_use blocks; execute concurrently
results = await asyncio.gather(*[call_tool(tc) for tc in tool_calls])
Saves latency dramatically when tools are independent. Important for performant agents.
MCP (Model Context Protocol)
A 2024 protocol from Anthropic for standardized tool servers. An MCP server exposes tools, prompts, and resources over a JSON-RPC interface; clients (Claude, IDE plugins, etc.) connect to it.
Benefits:
- Decouple tool implementation from agent code.
- Reuse tool servers across multiple agents/products.
- Marketplace of community-built servers (search, databases, dev tools).
Adoption growing fast in late 2025/2026.
Auth and secrets
Tools that need credentials:
- Per-user: pass tokens through context, not as tool inputs (prevents the model from leaking them).
- Service tokens: keep them server-side; the agent calls a wrapper that injects the token.
- OAuth flows: agent triggers a flow; user authenticates out-of-band; agent resumes.
Never let the model see API keys in the prompt or as tool arguments.
Handling long tool outputs
A single tool call can return megabytes (a long document, a big query result). Strategies:
- Pagination: tools return chunks; the model can ask for more.
- Summarization: tool result is summarized before being added to context.
- External storage: tool returns an ID; further tools read details by ID.
- Truncation with notice: “[Output truncated to 1000 tokens. Call get_more for the rest.]”
The agent’s working memory is precious; don’t waste it on unfiltered tool output.
Tool use evaluation
How do you know your tools work?
- Functional tests: each tool has unit tests against its handler.
- Selection tests: given an instruction, does the model pick the right tool?
- End-to-end traces: log every tool call in production; review weekly.
- Invalid call rate: % of tool calls that fail schema validation.
- Recovery rate: when a tool errors, does the agent recover?
Pitfalls
- Vague descriptions → the model picks the wrong tool.
- Overlapping tools → analysis paralysis; the model dithers.
- No error feedback → the model retries the same broken call.
- Side effects without warning → the model accidentally sends destructive actions.
- Unbounded tool output → context blows up, model gets lost.
- Expensive tools called freely → cost explosion.
Watch it interactively
- Tool Use Builder — pick a scenario, watch the model emit a JSON tool call, see the runtime execute it, see the result feed back. The whole protocol made explicit.
- Structured Outputs — the schema-as-contract foundation tool calling rests on. Edit the response live; watch a JSON-schema validator catch every kind of violation.
- Agent Trace Viewer — full trace with the failure-injection toggle showing how tool errors get handled (transient retry vs permanent give-up).
Build it in code
/ship/09— tools and function calling —Tool/ToolRegistry, JSON-schema auto-generated from Python type hints + Google-style docstrings, OSS-model adapters, the propose-then-act pattern. ~180 lines./case-studies/02— code-review agent — propose-then-act tools in a real product (the queue → wrapper-commits pattern that prevents agent rogue actions).