LLM Tool Use and Function Calling in Production: From Basic Integration to Advanced Orchestration

A practical guide to building production-grade LLM tool use — from Claude and OpenAI function calling basics through parallel execution, tool search, error handling, and security hardening with working code examples.

Large language models are remarkably good at reasoning, planning, and generating text. But here's the thing — intelligence without the ability to act is just commentary. The moment you need an LLM to check a database, call an API, send an email, or look up today's weather, you cross the threshold from text generation into tool use. And that's where production AI systems get genuinely powerful and, honestly, genuinely difficult to build well.

Tool use (also called function calling) is the mechanism that lets an LLM declare its intent to invoke an external function, receive the result, and fold that result back into its ongoing reasoning. It's the bridge between the model's intelligence and the real world. Without it, your chatbot can discuss the weather philosophically; with it, your chatbot can tell you it's currently 72 degrees and sunny in Austin.

The landscape has evolved fast. Early function calling was brittle — a single JSON schema, a single call, hope for the best. Modern implementations support parallel tool execution, programmatic orchestration, tool catalogs with thousands of entries, and sophisticated error handling.

This article walks through the full spectrum: from fundamental tool-use mechanics and working code examples for both Claude and OpenAI, through advanced patterns like tool search and programmatic tool calling, to the production hardening that separates demos from deployed systems. If you've read our articles on multi-agent AI systems and agentic RAG pipelines, consider this the deep dive into the execution layer that makes those architectures actually work.

Understanding Tool Use Fundamentals

At its core, tool use follows a structured request-response loop. You define a set of tools (functions with names, descriptions, and parameter schemas) and send them to the model alongside the user's message. The model reasons about whether and which tools to call, then responds — not with a final answer, but with a structured tool invocation request. Your application executes the tool, returns the result, and the model incorporates that into its final response.

Simple enough in theory. The devil, as always, is in the details.

The Tool Use Loop

Every modern LLM that supports tool use follows the same fundamental cycle:

  1. Define tools — You describe available tools using JSON schemas that specify each tool's name, description, and expected parameters.
  2. Send request — You send the user's message plus tool definitions to the model.
  3. Model decides — The model determines whether to call a tool, which tool to call, and with what arguments.
  4. Execute tool — Your application receives the structured tool call, executes the actual function, and gathers the result.
  5. Return result — You send the tool result back to the model in a follow-up message.
  6. Model responds — The model uses the tool result to formulate its final response to the user, or decides to call additional tools.

This loop can repeat multiple times within a single conversation turn. A model might call a search tool, analyze the results, then call a database query tool, then synthesize everything into a final answer. Each iteration adds a round trip between your application and the API, which is why understanding and optimizing this loop matters so much in production.

Tool Definitions and Schemas

Tool definitions are the contract between your application and the model. A well-written tool definition includes a clear name, a detailed description that tells the model when and why to use the tool, and an input schema that specifies required and optional parameters with their types and constraints.

The quality of your tool definitions directly impacts how reliably the model selects and parameterizes tool calls. Treat them like API documentation for an audience that reads extremely carefully.

Best Practice: Write tool descriptions as if you're explaining the tool to a new team member. Include what the tool does, when to use it (and when not to), what the parameters mean, and what the return value looks like. Vague descriptions produce vague tool calls.

Implementing Tool Use with Claude and OpenAI

Let's build working implementations with both major providers. These examples use realistic tools — a weather API and a database lookup — that demonstrate the patterns you'll encounter in production.

Claude API Tool Use (Anthropic SDK)

Claude's tool use implementation centers on content blocks. When Claude decides to use a tool, it returns a tool_use content block with the tool name and input. You execute the tool and return a tool_result content block. Here's a complete working example:

import anthropic
import json

client = anthropic.Anthropic()

# Define tools with detailed schemas
tools = [
    {
        "name": "get_weather",
        "description": (
            "Get the current weather for a given location. Use this when "
            "the user asks about current weather conditions, temperature, "
            "or forecast for a specific city or region. Returns temperature "
            "in the requested unit, conditions, and humidity."
        ),
        "input_schema": {
            "type": "object",
            "properties": {
                "location": {
                    "type": "string",
                    "description": "City and state/country, e.g. 'San Francisco, CA'"
                },
                "unit": {
                    "type": "string",
                    "enum": ["celsius", "fahrenheit"],
                    "description": "Temperature unit. Defaults to fahrenheit."
                }
            },
            "required": ["location"]
        }
    },
    {
        "name": "lookup_order",
        "description": (
            "Look up an order by order ID. Use this when a user asks about "
            "the status, details, or tracking information for a specific order. "
            "Returns order status, items, shipping info, and estimated delivery."
        ),
        "input_schema": {
            "type": "object",
            "properties": {
                "order_id": {
                    "type": "string",
                    "description": "The unique order identifier, e.g. 'ORD-12345'"
                }
            },
            "required": ["order_id"]
        }
    }
]


def execute_tool(tool_name: str, tool_input: dict) -> str:
    """Execute a tool and return the result as a string."""
    if tool_name == "get_weather":
        # In production, this would call a real weather API
        return json.dumps({
            "location": tool_input["location"],
            "temperature": 72,
            "unit": tool_input.get("unit", "fahrenheit"),
            "conditions": "Partly cloudy",
            "humidity": 45
        })
    elif tool_name == "lookup_order":
        return json.dumps({
            "order_id": tool_input["order_id"],
            "status": "shipped",
            "items": ["Widget Pro", "Widget Stand"],
            "tracking_number": "1Z999AA10123456784",
            "estimated_delivery": "2026-02-18"
        })
    else:
        return json.dumps({"error": f"Unknown tool: {tool_name}"})


def chat_with_tools(user_message: str) -> str:
    """Run a complete tool-use conversation loop."""
    messages = [{"role": "user", "content": user_message}]

    while True:
        response = client.messages.create(
            model="claude-sonnet-4-20250514",
            max_tokens=4096,
            tools=tools,
            tool_choice={"type": "auto"},  # Let Claude decide
            messages=messages
        )

        # Check if Claude wants to use tools
        if response.stop_reason == "tool_use":
            # Collect all tool use blocks from the response
            tool_results = []
            for block in response.content:
                if block.type == "tool_use":
                    result = execute_tool(block.name, block.input)
                    tool_results.append({
                        "type": "tool_result",
                        "tool_use_id": block.id,
                        "content": result
                    })

            # Add assistant response and tool results to messages
            messages.append({"role": "assistant", "content": response.content})
            messages.append({"role": "user", "content": tool_results})

        else:
            # No more tool calls — extract final text
            final_text = ""
            for block in response.content:
                if hasattr(block, "text"):
                    final_text += block.text
            return final_text


# Usage
answer = chat_with_tools("What's the weather in Denver and where is my order ORD-7891?")
print(answer)

Notice the tool_choice parameter. Claude supports three modes:

  • {"type": "auto"} (default) — Claude decides whether to use a tool or respond directly. This is the right choice for most conversational applications.
  • {"type": "any"} — Claude must use at least one tool. Useful when you know the user's request requires tool execution and want to prevent the model from just guessing an answer.
  • {"type": "tool", "name": "get_weather"} — Force Claude to use a specific tool. Useful in structured pipelines where you know exactly which tool should run next.

OpenAI Responses API Function Calling

OpenAI's Responses API uses a similar pattern with some structural differences. Function definitions are wrapped in a tool object, and OpenAI offers a strict mode that guarantees the model's output conforms exactly to your JSON schema — which is honestly a pretty big deal for production reliability:

from openai import OpenAI
import json

client = OpenAI()

tools = [
    {
        "type": "function",
        "function": {
            "name": "get_weather",
            "description": (
                "Get current weather for a location. Returns temperature, "
                "conditions, and humidity."
            ),
            "parameters": {
                "type": "object",
                "properties": {
                    "location": {
                        "type": "string",
                        "description": "City and state, e.g. 'Boston, MA'"
                    },
                    "unit": {
                        "type": "string",
                        "enum": ["celsius", "fahrenheit"]
                    }
                },
                "required": ["location"],
                "additionalProperties": False
            },
            "strict": True  # Enforce exact schema conformance
        }
    },
    {
        "type": "function",
        "function": {
            "name": "search_knowledge_base",
            "description": (
                "Search the internal knowledge base for information. Use for "
                "product questions, policy lookups, or documentation queries."
            ),
            "parameters": {
                "type": "object",
                "properties": {
                    "query": {
                        "type": "string",
                        "description": "The search query"
                    },
                    "max_results": {
                        "type": "integer",
                        "description": "Maximum results to return (1-10)"
                    }
                },
                "required": ["query", "max_results"],
                "additionalProperties": False
            },
            "strict": True
        }
    }
]


def execute_function(name: str, arguments: dict) -> str:
    """Execute a function and return the result."""
    if name == "get_weather":
        return json.dumps({
            "temperature": 55,
            "unit": arguments.get("unit", "fahrenheit"),
            "conditions": "Rainy",
            "location": arguments["location"]
        })
    elif name == "search_knowledge_base":
        return json.dumps({
            "results": [
                {"title": "Return Policy", "snippet": "30-day return window..."},
                {"title": "Shipping FAQ", "snippet": "Free shipping over $50..."}
            ],
            "total": 2
        })
    return json.dumps({"error": f"Unknown function: {name}"})


def chat_with_functions(user_message: str) -> str:
    """Run a complete function-calling conversation with OpenAI."""
    messages = [{"role": "user", "content": user_message}]

    while True:
        response = client.chat.completions.create(
            model="gpt-4o",
            messages=messages,
            tools=tools
        )

        message = response.choices[0].message

        if message.tool_calls:
            # Add the assistant's message (with tool calls) to history
            messages.append(message)

            # Execute each tool call and add results
            for tool_call in message.tool_calls:
                arguments = json.loads(tool_call.function.arguments)
                result = execute_function(
                    tool_call.function.name, arguments
                )
                messages.append({
                    "role": "tool",
                    "tool_call_id": tool_call.id,
                    "content": result
                })
        else:
            return message.content


# Usage
answer = chat_with_functions("What's your return policy and is it raining in Seattle?")
print(answer)
Tip: OpenAI's strict: true mode is a significant production advantage. When enabled, the model is guaranteed to produce function arguments that match your JSON schema exactly — no missing required fields, no wrong types, no extra properties. This eliminates an entire class of runtime errors. The tradeoff is a slight increase in latency on the first call (for schema compilation) and some schema restrictions (no default values, all optional fields must be in required).

Parallel Tool Execution

Modern LLMs can figure out when multiple tools need to be called at once. When a user asks "What's the weather in Denver and look up order ORD-123?", the model recognizes these are independent operations and requests both tool calls in a single response, rather than making them sequentially.

This is one of those features that sounds minor but makes a huge difference in practice.

How Parallel Tool Calls Work

In Claude, parallel tool use is enabled by default. When Claude determines that multiple tools can be called independently, it returns multiple tool_use content blocks in a single response. Your application should execute them — ideally concurrently — and return all results at once.

import asyncio
import anthropic
import json

client = anthropic.Anthropic()


async def execute_tool_async(tool_name: str, tool_input: dict) -> dict:
    """Execute a tool asynchronously."""
    # Simulate async API calls
    await asyncio.sleep(0.1)
    if tool_name == "get_weather":
        return {"temperature": 68, "conditions": "Sunny"}
    elif tool_name == "lookup_order":
        return {"status": "delivered", "delivered_date": "2026-02-12"}
    return {"error": "Unknown tool"}


async def handle_parallel_tools(response) -> list:
    """Execute all tool calls from a response in parallel."""
    tool_use_blocks = [
        block for block in response.content if block.type == "tool_use"
    ]

    # Run all tool calls concurrently
    tasks = [
        execute_tool_async(block.name, block.input)
        for block in tool_use_blocks
    ]
    results = await asyncio.gather(*tasks)

    # Build tool_result blocks
    tool_results = []
    for block, result in zip(tool_use_blocks, results):
        tool_results.append({
            "type": "tool_result",
            "tool_use_id": block.id,
            "content": json.dumps(result)
        })

    return tool_results

When to Disable Parallel Tool Use

Parallel execution isn't always desirable. When tool calls have dependencies — the output of one tool feeds into the input of another — you need sequential execution. Claude provides a parameter to control this:

# Disable parallel tool use when tools have dependencies
response = client.messages.create(
    model="claude-sonnet-4-20250514",
    max_tokens=4096,
    tools=tools,
    tool_choice={"type": "auto", "disable_parallel_tool_use": True},
    messages=messages
)

Common scenarios where you should disable parallel tool use:

  • Data dependencies — Tool B needs the result of Tool A (e.g., search for a user, then look up their orders).
  • Stateful operations — Tools that modify shared state (e.g., adding items to a cart, then calculating the total).
  • Rate-limited APIs — When your downstream services can't handle concurrent calls from the same session.
  • Transaction ordering — Financial or order-processing workflows where operation sequence matters.

In OpenAI, you can set parallel_tool_calls: false in the request to achieve the same effect. The default is true for models that support it.

Advanced Pattern: Tool Search for Large Tool Catalogs

What happens when your application has not 5 tools, but 500 — or 5,000? Sending all tool definitions with every request is impractical: it eats up massive context window space, drives up costs, and actually degrades the model's tool selection accuracy. More choices means more confusion. Anthropic's Tool Search feature, available in beta, solves this problem elegantly.

How Tool Search Works

Tool Search is a built-in Anthropic tool that lets Claude search through a large catalog of tools to find relevant ones, instead of receiving all tool definitions upfront. You define your tools with defer_loading: true, which tells the API to withhold the tool schema from the model's context until Claude explicitly searches for it.

import anthropic

client = anthropic.Anthropic()

# Define a large catalog of tools with deferred loading
tools = [
    {
        "name": "get_weather",
        "description": "Get current weather for a location",
        "defer_loading": True,  # Schema not loaded until searched
        "input_schema": {
            "type": "object",
            "properties": {
                "location": {"type": "string"},
                "unit": {"type": "string", "enum": ["celsius", "fahrenheit"]}
            },
            "required": ["location"]
        }
    },
    {
        "name": "search_flights",
        "description": "Search for available flights between airports",
        "defer_loading": True,
        "input_schema": {
            "type": "object",
            "properties": {
                "origin": {"type": "string"},
                "destination": {"type": "string"},
                "date": {"type": "string", "format": "date"}
            },
            "required": ["origin", "destination", "date"]
        }
    },
    # ... hundreds or thousands more tools defined here ...
    {
        "name": "book_hotel",
        "description": "Book a hotel room for specified dates",
        "defer_loading": True,
        "input_schema": {
            "type": "object",
            "properties": {
                "city": {"type": "string"},
                "check_in": {"type": "string", "format": "date"},
                "check_out": {"type": "string", "format": "date"},
                "guests": {"type": "integer"}
            },
            "required": ["city", "check_in", "check_out"]
        }
    }
]

# Enable Tool Search as a server-managed tool
response = client.messages.create(
    model="claude-sonnet-4-20250514",
    max_tokens=4096,
    tools=tools + [
        {
            "type": "tool_search",
            "name": "tool_search",
            "max_results": 5  # Return top 3-5 matching tools
        }
    ],
    messages=[{"role": "user", "content": "I need to fly from JFK to LAX next Friday"}]
)

Context Window Savings

The efficiency gains here are dramatic. With 500 deferred tools, only the tool names and descriptions are included in the initial context — the full schemas (which are the bulk of the token cost) are loaded only for the 3-5 tools that Claude searches for and identifies as relevant. Anthropic reports up to 85% reduction in context usage compared to sending all tool definitions.

Tool Search supports up to 10,000 tools in a single request and uses BM25 and regex-based search under the hood to find the most relevant matches. The search is handled server-side by Anthropic, so you don't need to implement any retrieval logic yourself.

When to use Tool Search: If your application has more than 20-30 tools, Tool Search starts becoming advantageous. Below that threshold, the overhead of the search step isn't worth the context savings. Above 100 tools, it becomes essentially mandatory — the model's ability to select the right tool degrades significantly when it's presented with too many options at once.

Advanced Pattern: Programmatic Tool Calling (PTC)

Standard tool use requires a round trip between Claude and your application for every single tool call. If Claude needs to call five tools to answer a question, that's five API round trips — each adding latency, cost, and complexity. Anthropic's Programmatic Tool Calling (PTC) beta introduces a fundamentally different approach: Claude writes Python code that orchestrates multiple tool calls programmatically.

I'll be honest — when I first heard about this, it sounded like overkill. Then I saw the token savings on a real workflow with 8+ tool calls per turn, and it clicked immediately.

How PTC Works

With PTC enabled, instead of returning individual tool_use blocks, Claude generates a Python script that calls your tools, processes intermediate results, and produces a final output — all in a single turn. The script runs in a sandboxed code execution environment, and the tool calls within the script are intercepted and routed to your actual tool implementations.

import anthropic

client = anthropic.Anthropic()

# Enable PTC with the required beta header
response = client.beta.messages.create(
    model="claude-sonnet-4-20250514",
    betas=["advanced-tool-use-2025-11-20"],
    max_tokens=4096,
    tools=[
        {
            "name": "get_stock_price",
            "description": "Get current stock price by ticker symbol",
            "input_schema": {
                "type": "object",
                "properties": {
                    "ticker": {
                        "type": "string",
                        "description": "Stock ticker symbol, e.g. AAPL"
                    }
                },
                "required": ["ticker"]
            }
        },
        {
            "name": "get_exchange_rate",
            "description": "Get exchange rate between two currencies",
            "input_schema": {
                "type": "object",
                "properties": {
                    "from_currency": {"type": "string"},
                    "to_currency": {"type": "string"}
                },
                "required": ["from_currency", "to_currency"]
            }
        }
    ],
    messages=[{
        "role": "user",
        "content": (
            "Get the prices of AAPL, GOOGL, and MSFT, "
            "and convert each to EUR."
        )
    }]
)

Without PTC, this request would require at least six round trips: three stock price lookups and three exchange rate conversions, likely spread across multiple turns. With PTC, Claude generates a Python script that orchestrates all six calls in a single turn, processes the results, performs the conversions, and formats the output — all within one API call.

Token Savings and Performance

Anthropic reports approximately 37% token savings with PTC compared to standard multi-turn tool use, because the intermediate reasoning and tool results don't need to be re-sent in the conversation context with each round trip. Latency improvements are similarly significant since you're eliminating multiple network round trips.

When to Use PTC vs Standard Tool Use

  • Use PTC when — You have tasks requiring multiple tool calls with data transformations between them, batch operations across many items, or complex conditional logic (if stock price is above X, do Y, otherwise do Z).
  • Use standard tool use when — You need fine-grained control over each tool execution, your tools have side effects that require human approval, or you need to maintain conversation state between tool calls for user interaction.
Important: PTC runs in a sandboxed execution environment. Your tool implementations are still called through the normal tool execution pathway — PTC doesn't execute arbitrary code against your infrastructure. The Python sandbox handles the orchestration logic; actual tool calls are intercepted and routed safely.

Production Error Handling and Resilience

In production, tools fail. APIs time out, databases go down, rate limits get hit, inputs arrive malformed. It's not a question of if but when.

How your system handles these failures determines whether you've built a demo or a deployed product. Robust error handling in tool-use systems requires thinking about failures at multiple levels.

Informative Error Messages

When a tool fails, the error message you return to the model matters enormously. The model uses that information to decide whether to retry, try a different approach, or inform the user. Return structured, actionable error messages — not stack traces.

import json
from enum import Enum


class ErrorType(Enum):
    RETRIABLE = "retriable"
    NON_RETRIABLE = "non_retriable"
    USER_FACING = "user_facing"


def execute_tool_with_error_handling(tool_name: str, tool_input: dict) -> dict:
    """Execute a tool with structured error handling."""
    try:
        result = _call_tool_implementation(tool_name, tool_input)
        return {
            "type": "tool_result",
            "content": json.dumps(result)
        }
    except RateLimitError:
        return {
            "type": "tool_result",
            "content": json.dumps({
                "error": True,
                "error_type": ErrorType.RETRIABLE.value,
                "message": (
                    "Rate limit exceeded for this API. "
                    "The tool can be retried after a brief wait."
                ),
                "retry_after_seconds": 5
            }),
            "is_error": True  # Claude-specific: signals this is an error
        }
    except ValidationError as e:
        return {
            "type": "tool_result",
            "content": json.dumps({
                "error": True,
                "error_type": ErrorType.NON_RETRIABLE.value,
                "message": (
                    f"Invalid input: {str(e)}. "
                    "Please check the parameters and try with corrected values."
                )
            }),
            "is_error": True
        }
    except ExternalServiceError:
        return {
            "type": "tool_result",
            "content": json.dumps({
                "error": True,
                "error_type": ErrorType.USER_FACING.value,
                "message": (
                    "The external service is temporarily unavailable. "
                    "Please inform the user that this feature is "
                    "currently experiencing issues."
                )
            }),
            "is_error": True
        }
Key distinction: Separate model-facing errors (which help the model decide what to do next) from user-facing errors (which should be passed through to the human). A "rate limit exceeded, retry in 5 seconds" message is for the model. A "this service is currently down" message is for the user. Including this distinction in your error structure helps the model make much better decisions about how to handle failures.

Retry Logic with Exponential Backoff and Jitter

For retriable errors, implement exponential backoff with jitter. The jitter part is critical — without it, if multiple requests hit a rate limit simultaneously, they'll all retry at exactly the same time, causing a "thundering herd" that triggers another rate limit. I've seen this happen in production and it's not fun to debug at 2 AM.

import random
import time
import logging

logger = logging.getLogger(__name__)


def retry_with_backoff(
    func,
    max_retries: int = 3,
    base_delay: float = 1.0,
    max_delay: float = 30.0,
    jitter: bool = True
):
    """Retry a function with exponential backoff and optional jitter."""
    for attempt in range(max_retries + 1):
        try:
            return func()
        except RetriableError as e:
            if attempt == max_retries:
                logger.error(
                    f"All {max_retries} retries exhausted: {e}"
                )
                raise

            delay = min(base_delay * (2 ** attempt), max_delay)
            if jitter:
                delay = delay * (0.5 + random.random())

            logger.warning(
                f"Attempt {attempt + 1} failed: {e}. "
                f"Retrying in {delay:.2f}s"
            )
            time.sleep(delay)

Idempotent Tool Design

The model may call the same tool multiple times with the same arguments — due to retries, conversation loops, or simply because it re-evaluates its approach. State-changing tools (creating records, sending emails, processing payments) must be idempotent: calling them twice with the same inputs should produce the same result, not duplicate the action.

This is one of those things that's easy to overlook during development and painful to fix after you've sent a customer the same email three times.

import hashlib
import json

# In-memory idempotency store (use Redis/database in production)
_idempotency_store: dict = {}


def idempotent_tool_call(
    tool_name: str,
    tool_input: dict,
    execute_fn
) -> dict:
    """Ensure a tool call is executed at most once for given inputs."""
    # Generate a deterministic key from tool name + sorted input
    key_data = json.dumps(
        {"tool": tool_name, "input": tool_input},
        sort_keys=True
    )
    idempotency_key = hashlib.sha256(key_data.encode()).hexdigest()

    if idempotency_key in _idempotency_store:
        logger.info(f"Returning cached result for {tool_name}")
        return _idempotency_store[idempotency_key]

    result = execute_fn(tool_name, tool_input)
    _idempotency_store[idempotency_key] = result
    return result

Preventing Infinite Tool-Call Loops

Here's a subtle but dangerous failure mode: the model enters a loop where it keeps calling tools without making progress. This can happen when a tool returns ambiguous results, when the model misinterprets an error as a signal to retry indefinitely, or when circular dependencies exist between tools.

Always implement a maximum tool-call limit:

MAX_TOOL_CALLS = 15  # Hard limit per conversation turn


def chat_with_tool_limit(user_message: str) -> str:
    """Run tool-use loop with a safety limit on total tool calls."""
    messages = [{"role": "user", "content": user_message}]
    tool_call_count = 0

    while True:
        response = client.messages.create(
            model="claude-sonnet-4-20250514",
            max_tokens=4096,
            tools=tools,
            messages=messages
        )

        if response.stop_reason == "tool_use":
            # Count tool calls in this response
            new_calls = sum(
                1 for block in response.content
                if block.type == "tool_use"
            )
            tool_call_count += new_calls

            if tool_call_count > MAX_TOOL_CALLS:
                # Force a text response by not processing more tools
                messages.append({
                    "role": "assistant",
                    "content": response.content
                })
                messages.append({
                    "role": "user",
                    "content": [{
                        "type": "tool_result",
                        "tool_use_id": block.id,
                        "content": (
                            "Tool call limit reached. Please provide "
                            "your best answer with the information "
                            "gathered so far."
                        ),
                        "is_error": True
                    } for block in response.content
                      if block.type == "tool_use"]
                })
                continue

            # Normal tool execution
            tool_results = []
            for block in response.content:
                if block.type == "tool_use":
                    result = execute_tool(block.name, block.input)
                    tool_results.append({
                        "type": "tool_result",
                        "tool_use_id": block.id,
                        "content": result
                    })

            messages.append({
                "role": "assistant",
                "content": response.content
            })
            messages.append({"role": "user", "content": tool_results})
        else:
            final_text = ""
            for block in response.content:
                if hasattr(block, "text"):
                    final_text += block.text
            return final_text
Warning: Set your max tool call limit based on your application's expected behavior. A simple Q&A bot might need at most 3-5 tool calls per turn. A complex research agent might legitimately need 15-20. Set the limit too low and you hobble the model; set it too high and you risk runaway costs and latency. Monitor actual tool call counts in production to calibrate — the data will surprise you.

Tool Orchestration Patterns

Individual tool calls are building blocks. Production systems need orchestration — patterns for chaining tools together, routing requests to the right tools, and managing state across multi-step workflows. These patterns show up repeatedly across different domains and are worth understanding as reusable architectural components.

Sequential Tool Chains

The simplest orchestration pattern: output of one tool feeds into the next. Search for a user, look up their orders, then check inventory for their most recent order. This requires disabling parallel tool calls and designing your prompts so the model understands the dependency chain.

def run_sequential_pipeline(user_query: str) -> str:
    """Run a sequential tool chain: search -> enrich -> respond."""
    system_prompt = """You are a customer service agent. To answer questions:
1. First search for the customer using their name or email.
2. Then look up their recent orders using the customer ID from step 1.
3. Then check order details for the relevant order.
Always complete each step before proceeding to the next."""

    messages = [{"role": "user", "content": user_query}]

    response = client.messages.create(
        model="claude-sonnet-4-20250514",
        max_tokens=4096,
        system=system_prompt,
        tools=customer_service_tools,
        tool_choice={"type": "auto", "disable_parallel_tool_use": True},
        messages=messages
    )

    # Continue the tool-use loop (same pattern as before)
    return process_tool_loop(response, messages)

Router Pattern

The router pattern uses an initial LLM call (or a lightweight classifier) to determine which tool or tool set is appropriate, then delegates to specialized handlers. This is especially useful when you have tool categories that shouldn't be mixed — financial tools, customer service tools, and content management tools, for instance.

from dataclasses import dataclass


@dataclass
class ToolGroup:
    name: str
    tools: list
    system_prompt: str


TOOL_GROUPS = {
    "billing": ToolGroup(
        name="billing",
        tools=[lookup_invoice, process_refund, check_payment_status],
        system_prompt="You are a billing specialist. Help with invoices and payments."
    ),
    "technical": ToolGroup(
        name="technical",
        tools=[search_docs, check_service_status, create_ticket],
        system_prompt="You are a technical support agent. Help with product issues."
    ),
    "general": ToolGroup(
        name="general",
        tools=[search_faq, escalate_to_human],
        system_prompt="You are a general support agent."
    )
}


def route_and_execute(user_message: str) -> str:
    """Route user query to appropriate tool group, then execute."""
    # Step 1: Classify the query
    routing_response = client.messages.create(
        model="claude-haiku-4-20250514",  # Fast, cheap model for routing
        max_tokens=50,
        system=(
            "Classify the user query into one category: "
            "billing, technical, or general. "
            "Respond with only the category name."
        ),
        messages=[{"role": "user", "content": user_message}]
    )
    category = routing_response.content[0].text.strip().lower()
    group = TOOL_GROUPS.get(category, TOOL_GROUPS["general"])

    # Step 2: Execute with the appropriate tool group
    response = client.messages.create(
        model="claude-sonnet-4-20250514",
        max_tokens=4096,
        system=group.system_prompt,
        tools=group.tools,
        messages=[{"role": "user", "content": user_message}]
    )

    return process_tool_loop(response, messages=[])

Orchestrator-Worker Pattern

For complex tasks, an orchestrator model breaks the work into subtasks, delegates each to a worker (which may itself use tools), and synthesizes the results. This mirrors the multi-agent architectures discussed in our article on multi-agent AI systems, with tool use providing the execution mechanism.

import json


def orchestrate_research(query: str) -> str:
    """Orchestrator-worker pattern: plan, delegate, synthesize."""
    # Step 1: Orchestrator creates a plan
    plan_response = client.messages.create(
        model="claude-sonnet-4-20250514",
        max_tokens=1024,
        system=(
            "You are a research planner. Given a question, produce a JSON "
            "array of 2-4 sub-questions that, when answered, will fully "
            "address the main question. Return only valid JSON."
        ),
        messages=[{"role": "user", "content": query}]
    )
    sub_questions = json.loads(plan_response.content[0].text)

    # Step 2: Workers execute each sub-question with tools
    worker_results = []
    for sub_q in sub_questions:
        worker_response = chat_with_tools(sub_q)  # Uses tool-calling loop
        worker_results.append({
            "question": sub_q,
            "answer": worker_response
        })

    # Step 3: Orchestrator synthesizes results
    synthesis_prompt = f"""Original question: {query}

Research results:
{json.dumps(worker_results, indent=2)}

Synthesize these results into a comprehensive answer to the original question."""

    final_response = client.messages.create(
        model="claude-sonnet-4-20250514",
        max_tokens=4096,
        messages=[{"role": "user", "content": synthesis_prompt}]
    )
    return final_response.content[0].text

State Management Between Tool Calls

Multi-step tool workflows often need to maintain state — shopping cart contents, accumulated search results, running totals, conversation context. While the LLM maintains implicit state through the conversation history, explicit state management prevents context window bloat and ensures consistency.

from dataclasses import dataclass, field
from typing import Any


@dataclass
class ToolSession:
    """Maintain state across tool calls within a session."""
    session_id: str
    state: dict = field(default_factory=dict)
    tool_call_history: list = field(default_factory=list)
    total_tool_calls: int = 0
    total_tokens_used: int = 0

    def record_tool_call(
        self, tool_name: str, inputs: dict, output: Any, latency_ms: float
    ):
        self.tool_call_history.append({
            "tool": tool_name,
            "inputs": inputs,
            "output_summary": str(output)[:200],
            "latency_ms": latency_ms
        })
        self.total_tool_calls += 1

    def get_state(self, key: str, default=None):
        return self.state.get(key, default)

    def set_state(self, key: str, value: Any):
        self.state[key] = value

    def get_summary_for_model(self) -> str:
        """Generate a concise state summary to inject into the prompt."""
        return json.dumps({
            "session_id": self.session_id,
            "tools_called": self.total_tool_calls,
            "current_state": self.state
        }, indent=2)

LangGraph Integration

For more complex orchestration workflows, LangGraph provides a graph-based framework for defining stateful, multi-step agent workflows. Here's a minimal example of integrating tool use with LangGraph:

from langgraph.graph import StateGraph, END
from typing import TypedDict, Annotated
import operator


class AgentState(TypedDict):
    messages: Annotated[list, operator.add]
    tool_results: Annotated[list, operator.add]
    iteration_count: int


def should_continue(state: AgentState) -> str:
    """Decide whether to continue tool use or finish."""
    last_message = state["messages"][-1]
    if state["iteration_count"] > 5:
        return "end"
    if hasattr(last_message, "tool_calls") and last_message.tool_calls:
        return "execute_tools"
    return "end"


def call_model(state: AgentState) -> dict:
    """Call the LLM with current messages and tools."""
    response = client.messages.create(
        model="claude-sonnet-4-20250514",
        max_tokens=4096,
        tools=tools,
        messages=state["messages"]
    )
    return {
        "messages": [response],
        "iteration_count": state["iteration_count"] + 1
    }


def execute_tools(state: AgentState) -> dict:
    """Execute tool calls from the model's response."""
    last_response = state["messages"][-1]
    results = []
    for block in last_response.content:
        if block.type == "tool_use":
            result = execute_tool(block.name, block.input)
            results.append({
                "type": "tool_result",
                "tool_use_id": block.id,
                "content": result
            })
    return {"messages": [{"role": "user", "content": results}], "tool_results": results}


# Build the graph
workflow = StateGraph(AgentState)
workflow.add_node("agent", call_model)
workflow.add_node("execute_tools", execute_tools)
workflow.set_entry_point("agent")
workflow.add_conditional_edges("agent", should_continue, {
    "execute_tools": "execute_tools",
    "end": END
})
workflow.add_edge("execute_tools", "agent")

app = workflow.compile()

Observability and Monitoring

You can't improve what you can't measure. Tool-use systems have a larger surface area for failure than standard LLM calls, and observability is correspondingly more important. Every tool call is a potential point of failure, latency, and cost — and you need visibility into all three.

What to Track

At minimum, instrument every tool call with these metrics:

  • Tool call frequency — Which tools are called most often? Are any tools never called (probably a sign of a bad description or an unnecessary tool)?
  • Latency per tool — How long does each tool take? Are certain tools creating bottlenecks?
  • Success/failure rates — What percentage of tool calls succeed? Which tools fail most often and why?
  • Tool calls per conversation turn — Is the model calling an unexpectedly high number of tools? That can indicate confused tool selection or looping.
  • Token usage per turn — Tool results can be large. Are certain tools returning more data than necessary, inflating token costs?
  • Argument quality — Are tool arguments well-formed? Track malformed argument rates to identify tools that need better descriptions.

Integration with Langfuse

Langfuse provides purpose-built observability for LLM applications, with first-class support for tracing tool calls. Here's how to instrument your tool-use loop with Langfuse:

from langfuse import Langfuse
from langfuse.decorators import observe, langfuse_context
import time

langfuse = Langfuse()


@observe(as_type="generation")
def call_llm_with_tools(messages: list, tools: list) -> dict:
    """Traced LLM call with tools."""
    response = client.messages.create(
        model="claude-sonnet-4-20250514",
        max_tokens=4096,
        tools=tools,
        messages=messages
    )
    langfuse_context.update_current_observation(
        metadata={
            "stop_reason": response.stop_reason,
            "tool_calls": sum(
                1 for b in response.content if b.type == "tool_use"
            )
        }
    )
    return response


@observe(as_type="tool")
def traced_tool_execution(tool_name: str, tool_input: dict) -> str:
    """Traced tool execution with timing."""
    start = time.time()
    try:
        result = execute_tool(tool_name, tool_input)
        latency = (time.time() - start) * 1000
        langfuse_context.update_current_observation(
            metadata={
                "latency_ms": latency,
                "success": True,
                "input_keys": list(tool_input.keys())
            }
        )
        return result
    except Exception as e:
        latency = (time.time() - start) * 1000
        langfuse_context.update_current_observation(
            metadata={
                "latency_ms": latency,
                "success": False,
                "error": str(e)
            }
        )
        raise

OpenTelemetry Integration

For teams already using OpenTelemetry, tool calls map naturally to spans within a trace. Each conversation turn is a parent span, with child spans for the LLM call and each tool execution:

from opentelemetry import trace
from opentelemetry.trace import StatusCode

tracer = trace.get_tracer("tool-use-agent")


def execute_tool_with_tracing(tool_name: str, tool_input: dict) -> str:
    """Execute a tool with OpenTelemetry tracing."""
    with tracer.start_as_current_span(
        f"tool.{tool_name}",
        attributes={
            "tool.name": tool_name,
            "tool.input_keys": str(list(tool_input.keys())),
        }
    ) as span:
        try:
            result = execute_tool(tool_name, tool_input)
            span.set_attribute("tool.success", True)
            span.set_attribute(
                "tool.result_length", len(result)
            )
            return result
        except Exception as e:
            span.set_status(StatusCode.ERROR, str(e))
            span.set_attribute("tool.success", False)
            span.record_exception(e)
            raise
Best Practice: Log tool inputs and outputs at the DEBUG level, and tool names, latencies, and success/failure at the INFO level. This lets you tune verbosity without touching code. In production, sample detailed logs (e.g., 10% of requests) to control storage costs while still maintaining diagnostic capability when things go sideways.

Security Considerations

Tool use expands your LLM application's attack surface significantly. The model is now generating structured inputs that your application executes — which means prompt injection attacks can trigger real-world actions, not just produce misleading text.

Security must be designed in from the start. Not bolted on after deployment. This is non-negotiable.

Input Validation

Never trust tool inputs from the model without validation. The model generates tool arguments based on the conversation, and a malicious user can craft inputs designed to manipulate those arguments:

from pydantic import BaseModel, Field, validator
from typing import Optional


class WeatherInput(BaseModel):
    """Validated input for the weather tool."""
    location: str = Field(
        ...,
        min_length=2,
        max_length=100,
        description="City and state/country"
    )
    unit: Optional[str] = Field(
        default="fahrenheit",
        pattern="^(celsius|fahrenheit)$"
    )

    @validator("location")
    def sanitize_location(cls, v):
        # Prevent injection via location field
        forbidden_chars = [";", "&", "|", "`", "$", "(", ")"]
        for char in forbidden_chars:
            if char in v:
                raise ValueError(
                    f"Invalid character in location: {char}"
                )
        return v.strip()


class DatabaseQueryInput(BaseModel):
    """Validated input for database lookup tools."""
    order_id: str = Field(
        ...,
        pattern="^ORD-[0-9]{5,10}$",
        description="Order ID in format ORD-XXXXX"
    )


def validate_and_execute(
    tool_name: str, raw_input: dict
) -> str:
    """Validate tool inputs before execution."""
    validators = {
        "get_weather": WeatherInput,
        "lookup_order": DatabaseQueryInput,
    }

    validator_class = validators.get(tool_name)
    if validator_class:
        try:
            validated = validator_class(**raw_input)
            return execute_tool(tool_name, validated.dict())
        except Exception as e:
            return json.dumps({
                "error": True,
                "message": f"Invalid input: {str(e)}"
            })

    return execute_tool(tool_name, raw_input)

Preventing Prompt Injection via Tool Results

Tool results get inserted into the conversation context, which means external data returned by tools can contain prompt injection attacks. An attacker who controls data in your database or API could embed instructions that manipulate the model's behavior. Mitigations include:

  • Sanitize tool results — Strip or escape any content that looks like prompt instructions before returning tool results to the model.
  • Use structured results — Return JSON with clear field names rather than free-text results. The model is less likely to treat structured data as instructions.
  • Limit result size — Truncate tool results to a reasonable length. Large results increase the surface area for injection and waste tokens.
  • Post-processing guards — After the model produces its final response, validate that it doesn't contain leaked tool data or unexpected behavioral changes. Integrate this with the guardrails patterns described in our article on LLM guardrails and safety.
import re

MAX_TOOL_RESULT_LENGTH = 5000  # Characters


def sanitize_tool_result(result: str) -> str:
    """Sanitize tool result before returning to the model."""
    # Truncate oversized results
    if len(result) > MAX_TOOL_RESULT_LENGTH:
        result = result[:MAX_TOOL_RESULT_LENGTH] + "... [truncated]"

    # Remove common prompt injection patterns
    injection_patterns = [
        r"ignore (all |your )?previous instructions",
        r"you are now",
        r"new system prompt",
        r"<system>",
        r"IMPORTANT:.*override",
    ]
    for pattern in injection_patterns:
        result = re.sub(pattern, "[filtered]", result, flags=re.IGNORECASE)

    return result

Principle of Least Privilege

Design your tools with the minimum permissions necessary for their function. A tool that reads order status shouldn't have write access to the orders database. A search tool shouldn't be able to modify the search index. This principle is familiar from traditional software security, but it's especially important when the "caller" is a probabilistic model rather than a deterministic code path.

  • Separate read and write tools — Instead of a single "manage_orders" tool, create "get_order_status" (read-only) and "update_order" (write, with additional confirmation requirements).
  • Scope API keys — Use tool-specific API keys with minimal permissions, not your application's master key.
  • Require confirmation for destructive actions — Tools that delete data, send communications, or process payments should require human-in-the-loop confirmation before execution.
  • Audit trail — Log every tool call with the full context (who triggered it, what arguments were used, what result was returned) for post-incident analysis.
Security principle: Treat every tool call as if it could be triggered by an adversary. Because, through prompt injection, it can be. Your validation, authorization, and logging layers are the defense — not the model's "judgment" about whether a request is legitimate. The model can't reliably distinguish between legitimate and malicious requests, especially when the malicious content is embedded in data rather than user input.

Putting It All Together: A Production Tool-Use Architecture

So let's assemble the patterns we've covered into a cohesive architecture. A production-grade tool-use system combines validated tool definitions, structured error handling, observability, security controls, and loop protection into a single, maintainable framework:

import anthropic
import json
import time
import logging
from dataclasses import dataclass, field
from typing import Callable, Optional

logger = logging.getLogger(__name__)


@dataclass
class ToolConfig:
    """Configuration for a registered tool."""
    name: str
    description: str
    input_schema: dict
    handler: Callable
    validator: Optional[type] = None  # Pydantic model
    requires_confirmation: bool = False
    max_result_length: int = 5000


class ProductionToolAgent:
    """Production-ready tool-use agent with error handling,
    observability, and security controls."""

    def __init__(
        self,
        model: str = "claude-sonnet-4-20250514",
        max_tool_calls: int = 15,
        max_retries: int = 2
    ):
        self.client = anthropic.Anthropic()
        self.model = model
        self.max_tool_calls = max_tool_calls
        self.max_retries = max_retries
        self.tools: dict[str, ToolConfig] = {}

    def register_tool(self, config: ToolConfig):
        """Register a tool with the agent."""
        self.tools[config.name] = config

    def _get_tool_definitions(self) -> list:
        """Generate API-compatible tool definitions."""
        return [
            {
                "name": cfg.name,
                "description": cfg.description,
                "input_schema": cfg.input_schema
            }
            for cfg in self.tools.values()
        ]

    def _execute_tool(
        self, tool_name: str, tool_input: dict
    ) -> str:
        """Execute a tool with validation and error handling."""
        config = self.tools.get(tool_name)
        if not config:
            return json.dumps({"error": f"Unknown tool: {tool_name}"})

        # Validate input
        if config.validator:
            try:
                validated = config.validator(**tool_input)
                tool_input = validated.dict()
            except Exception as e:
                return json.dumps({
                    "error": True,
                    "message": f"Validation failed: {str(e)}"
                })

        # Execute with retry for retriable errors
        start_time = time.time()
        try:
            result = config.handler(tool_input)
            result_str = json.dumps(result) if not isinstance(result, str) else result

            # Truncate oversized results
            if len(result_str) > config.max_result_length:
                result_str = result_str[:config.max_result_length] + "..."

            latency = (time.time() - start_time) * 1000
            logger.info(
                f"Tool {tool_name} succeeded in {latency:.0f}ms"
            )
            return result_str

        except Exception as e:
            latency = (time.time() - start_time) * 1000
            logger.error(
                f"Tool {tool_name} failed in {latency:.0f}ms: {e}"
            )
            return json.dumps({
                "error": True,
                "message": str(e)
            })

    def run(
        self, user_message: str, system_prompt: str = ""
    ) -> str:
        """Run a complete tool-use conversation."""
        messages = [{"role": "user", "content": user_message}]
        tool_call_count = 0

        while True:
            response = self.client.messages.create(
                model=self.model,
                max_tokens=4096,
                system=system_prompt,
                tools=self._get_tool_definitions(),
                messages=messages
            )

            if response.stop_reason == "tool_use":
                new_calls = sum(
                    1 for b in response.content
                    if b.type == "tool_use"
                )
                tool_call_count += new_calls

                if tool_call_count > self.max_tool_calls:
                    logger.warning(
                        f"Tool call limit reached: {tool_call_count}"
                    )
                    return (
                        "I was unable to complete this request within "
                        "the allowed number of tool calls. Please try "
                        "a more specific question."
                    )

                tool_results = []
                for block in response.content:
                    if block.type == "tool_use":
                        result = self._execute_tool(
                            block.name, block.input
                        )
                        tool_results.append({
                            "type": "tool_result",
                            "tool_use_id": block.id,
                            "content": result
                        })

                messages.append({
                    "role": "assistant",
                    "content": response.content
                })
                messages.append({
                    "role": "user",
                    "content": tool_results
                })
            else:
                final_text = ""
                for block in response.content:
                    if hasattr(block, "text"):
                        final_text += block.text
                return final_text

This framework gives you a solid foundation that handles the most common production concerns: tool registration, input validation, error handling, result sanitization, loop protection, and basic logging. Extend it with your observability platform of choice (Langfuse, OpenTelemetry, Datadog) and the security controls appropriate to your domain.

Conclusion

Tool use is what transforms an LLM from a knowledgeable conversationalist into an effective agent. The fundamentals are straightforward — define tools, handle the request-response loop, execute functions, return results. But production reliability demands a lot more: parallel execution strategies, large-catalog management with tool search, programmatic orchestration for complex workflows, robust error handling with retries and idempotency, comprehensive observability, and rigorous security controls.

The patterns in this article aren't theoretical. They're the solutions that production teams arrive at after encountering the same failure modes over and over again. Infinite tool loops, prompt injection through tool results, thundering herd retries, context window exhaustion from too many tool definitions — these are predictable problems with known solutions. Building them in from the start saves you from learning them through production incidents (which is never fun).

As the ecosystem matures, expect tool use to become even more powerful. Programmatic tool calling reduces round trips and costs. Tool search makes thousand-tool catalogs practical. Model improvements make tool selection more reliable. The direction is clear: LLM tool use is moving from a feature to a foundational capability, and teams that build robust tool-use infrastructure now will have a real advantage as deployed AI systems keep growing in complexity.

If you're building LLM-powered agents, start with the fundamentals in this article, instrument everything from day one, and design for failure. Your tools are the hands of your AI system — make sure they're steady, well-calibrated, and unable to cause harm even when directed by an adversary. That's the standard for production.

About the Author Editorial Team

Our team of expert writers and editors.