Structured Outputs vs Function Calling: 2026 Decision Guide
When to pick structured outputs vs function calling across OpenAI, Anthropic, and Gemini in 2026: schema enforcement, latency, and runnable code for each pattern.
Use structured outputs when you need a JSON object that conforms to a schema; use function calling when the model needs to decide whether to do something and pass arguments to one of several named tools. In practice the two features overlap. Structured outputs are a constrained-decoding mode that guarantees schema compliance, while function calling is a routing mechanism that happens to emit a schema-shaped payload. Picking the wrong one costs you retries, latency, and tool-loop complexity you didn't need. So, let's walk through the differences across OpenAI, Anthropic, and Google in 2026, with runnable code.
Structured outputs guarantee a JSON payload that matches a JSON Schema. They replace JSON mode, which is now legacy across all three major vendors.
Function calling is a routing decision plus a schema; use it when the model must pick from N tools or skip them entirely.
OpenAI's strict: true and Gemini's response_schema enforce schemas via constrained decoding. Anthropic enforces tool input schemas at the sampling layer with similar guarantees.
If you only need one shape back, structured outputs beat function calling on latency (no tool-loop) and on token cost (no tool_use wrapper).
You can combine both: force structured output inside a tool's input schema when you want the model to call a tool with a guaranteed argument shape.
Evaluations should track schema-conformance rate, field-accuracy rate, and tool-selection accuracy as three separate metrics. They fail in different ways.
What is the difference between structured outputs and function calling?
The cleanest mental model I use when onboarding teams: function calling answers "should I do something, and if so, which thing?" while structured outputs answers "give me this exact shape back." The two features grew out of the same constrained-decoding work, which is why they look similar on the wire (both ultimately produce a JSON object), but the surrounding semantics are different.
With function calling, you register a list of named tools, each with a JSON Schema describing its arguments. The model either emits a tool_use / tool_calls block selecting one (or several) of those tools, or it emits a plain text response declining to call any. Your application then executes the tool and feeds the result back in a follow-up turn. The schema constrains the arguments. The routing decision is unconstrained.
With structured outputs, you supply a single schema and the model is forced to emit a JSON object conforming to it on the very next response. There's no tool-loop, no second round trip, and no decision about whether to respond, only how to shape what it says. Under the hood, OpenAI implements this with token-level constrained decoding via a finite-state machine derived from the schema, and as of late 2025 Google's Gemini API uses a similar approach for response_schema. Anthropic's tool input schemas enforce types but allow more flexibility around optional fields.
The reason this distinction matters: if you only ever expected one shape back, wrapping the call in function-calling boilerplate buys you nothing. It costs you a round trip plus the input tokens for the tool_use echo on the next turn.
2026 vendor matrix: OpenAI, Anthropic, Gemini
Here's how the three major API surfaces compare today. I'll note the vendor differences without flame-warring; each has trade-offs.
A few things to flag. OpenAI's strict mode requires additionalProperties: false and every property listed in required. This trips up teams porting Pydantic models that mark fields as Optional (I hit this exact bug shipping a billing extractor last quarter). Anthropic doesn't have a dedicated structured-output flag because their answer is "define a tool, force its use," covered in the Anthropic tool use documentation. Gemini's schema dialect is the most restrictive of the three (OpenAPI 3.0 subset means no oneOf at the root and limited anyOf support).
When to use structured outputs
Reach for structured outputs when all three of these are true: (1) you know in advance the shape of the response, (2) the model isn't being asked to choose between actions, and (3) you want the result in one round trip. This is the dominant pattern for data extraction, classification, and any "fill out this form" task.
Here's a runnable example using OpenAI's strict mode to extract structured invoice data:
from openai import OpenAI
from pydantic import BaseModel, Field
from typing import Literal
client = OpenAI()
class LineItem(BaseModel):
description: str
quantity: int
unit_price_cents: int
class Invoice(BaseModel):
invoice_number: str = Field(description="Vendor's invoice ID")
issue_date: str = Field(description="ISO 8601 date")
currency: Literal["USD", "EUR", "GBP", "JPY"]
line_items: list[LineItem]
total_cents: int
resp = client.chat.completions.parse(
model="gpt-5",
messages=[
{"role": "system", "content": "Extract invoice fields exactly. No commentary."},
{"role": "user", "content": invoice_text},
],
response_format=Invoice,
)
invoice: Invoice = resp.choices[0].message.parsed
# Schema-conformant by construction. No try/except json.JSONDecodeError needed.
The equivalent on Gemini 2.5 Pro is a near-identical shape, but you pass response_schema directly:
from google import genai
from google.genai import types
client = genai.Client()
resp = client.models.generate_content(
model="gemini-2.5-pro",
contents=invoice_text,
config=types.GenerateContentConfig(
response_mime_type="application/json",
response_schema=Invoice, # Pydantic model accepted directly
),
)
invoice = Invoice.model_validate_json(resp.text)
Use function calling when the model is choosing among actions: querying a database, calling an external API, escalating to a human, or doing nothing. The schema describes arguments to those actions; the routing decision is the genuine model output. This is also the right primitive for agentic loops where the conversation continues across multiple turns of tool execution.
Here's a minimal Claude example showing tool selection plus argument extraction:
import anthropic
client = anthropic.Anthropic()
tools = [
{
"name": "search_orders",
"description": "Look up customer orders by email or order ID.",
"input_schema": {
"type": "object",
"properties": {
"email": {"type": "string", "format": "email"},
"order_id": {"type": "string", "pattern": "^ORD-[0-9]{6}$"},
},
"oneOf": [{"required": ["email"]}, {"required": ["order_id"]}],
},
},
{
"name": "issue_refund",
"description": "Refund an order. Requires manager approval for amounts over $200.",
"input_schema": {
"type": "object",
"properties": {
"order_id": {"type": "string"},
"amount_cents": {"type": "integer", "minimum": 1},
"reason": {"type": "string", "enum": ["damaged", "wrong_item", "late", "other"]},
},
"required": ["order_id", "amount_cents", "reason"],
},
},
]
resp = client.messages.create(
model="claude-opus-4-8",
max_tokens=1024,
tools=tools,
messages=[{"role": "user", "content": "My order ORD-481923 arrived broken, please refund."}],
)
for block in resp.content:
if block.type == "tool_use":
print(block.name, block.input)
Three things to notice. The model picks which tool from the description alone, so that text matters as much as the schema. Write it like API documentation. The oneOf in search_orders enforces that exactly one of email or order ID is provided, which Anthropic respects. And the enum on reason means you don't need to post-validate string values from the model.
For broader patterns around tool design, parallel execution, and tool-result loops, I'd point you at the production function-calling deep dive already on this site.
Is JSON mode deprecated in 2026?
Effectively yes, even where it still works. OpenAI's old response_format: {type: "json_object"} mode is still accepted by the API, but the official guidance is to migrate to json_schema mode. It gives no schema guarantees beyond "valid JSON," which means you still have to validate every field and retry on mismatch. Google's Gemini API deprecated the equivalent legacy mode in favor of response_schema during the 2.5 release cycle, and Anthropic never shipped a JSON mode at all because their tool-use mechanism already covers the use case.
The practical implication: if you're still using JSON mode in any production pipeline, you're paying the latency of retries on schema mismatches that strict mode would have eliminated. Honestly, I migrate clients off it whenever I see it in a code review. The migration is usually a 5-line change. Wrap your existing dict in a Pydantic BaseModel, pass it as response_format, and delete the json.loads + try/except that follows.
Combining structured outputs with tools
The two features compose. The most common pattern in production agents I've shipped: define a tool, and give that tool's input schema the same strictness you'd give a structured output. The model still chooses whether to call the tool, but if it does, the arguments are schema-conformant by construction.
A subtler pattern is using a tool to force a structured output on Anthropic. Because Claude has no dedicated structured-output mode, you create a single tool whose schema describes your desired response shape, then pass tool_choice: {"type": "tool", "name": "..."} to force the model to call it. The tool_use block's input field is your structured output. This works reliably and is the idiom Anthropic recommends:
The downside of this idiom is one extra tool_use wrapper in the response, which costs a handful of output tokens. Not a real concern at most volumes.
How to evaluate both before you ship
Schema-first means evals-before-deploy. I'll repeat that until people stop shipping function-calling code with no evaluation harness. Function calling and structured outputs fail in different ways, and your eval suite has to distinguish them. I track three orthogonal metrics:
Schema-conformance rate. The percentage of responses that validate against the declared schema. With strict mode on OpenAI/Gemini this should be 100%. If it isn't, your schema has an issue (likely an unsupported keyword, see the OpenAI structured outputs guide). With Anthropic forced tools this should also be near 100% modulo edge cases like extremely long generations hitting max_tokens.
Field-accuracy rate. For each field, the percentage of responses where the value matches a labeled ground truth. This is where the model's actual comprehension lives. Schemas guarantee shape, not correctness. Track per-field accuracy and you'll spot the one field that drags your aggregate down.
Tool-selection accuracy. Only for function calling: the percentage where the model picked the right tool (or correctly chose to call no tool). I separate this from arg-accuracy because the failure modes are different. A model that picks the wrong tool with perfect args still ships the wrong action.
A 30-row golden set per use case is the bare minimum I'd ship with; 200+ rows gets you meaningful confidence intervals on regressions. Run it on every prompt or schema change. For the broader eval architecture (CI integration, regression alerts, dataset versioning), I wrote up the patterns in our LLM evaluation pipelines guide.
Common pitfalls and how to avoid them
A short list of things I see in code review more often than I'd like:
Using function calling for a single-shape extraction. If you only ever forced one tool, just use structured outputs. The tool-use wrapper is dead weight.
Putting everything in one giant schema. OpenAI's strict mode has a hard limit of 100 properties and 5 levels of nesting (as of mid-2026). Split into smaller calls or use referenced sub-schemas.
Forgetting the description fields. The model uses description as part of its decoding signal, especially for ambiguous values like dates and currencies. Empty descriptions are leaving accuracy on the table.
Not handling refusals. OpenAI's strict mode can return a refusal field instead of a parsed object on safety-flagged inputs. Branch on it. Anthropic returns a text block instead of a tool block, so check resp.stop_reason == "end_turn" before assuming a tool was called.
Confusing parallel tool calls with batching. Parallel tool calls means the model emits multiple tool_use blocks in one response that you can execute concurrently; it does not batch your inputs. If you have 100 documents to extract, that's 100 separate API calls.
Skipping evals because the schema is "constrained." Schema constraints don't catch semantic errors. Always evaluate field accuracy independently.
One final note on portability: Pydantic models work as a lingua franca across all three providers, but each translates them slightly differently. The OpenAI SDK calls model_json_schema() with strict-mode tweaks; Anthropic accepts the raw output of model_json_schema(); Gemini accepts Pydantic models directly but rejects unsupported keywords with a 400. Build a thin adapter once per vendor and you can swap models with a config change. In my last project this paid off the first time we needed to A/B a cheaper model.
Frequently Asked Questions
Does Claude support structured outputs natively?
Not as a dedicated parameter, but the practical equivalent is forcing a single tool via tool_choice: {"type": "tool", "name": "..."}. The tool's input_schema is enforced and the resulting tool_use.input is your structured output. Reliability matches OpenAI strict mode in my evals.
Can you use structured outputs with tools at the same time?
Yes on Anthropic and on Gemini. Both let you declare tools and a response schema simultaneously. OpenAI requires you to choose between tools and response_format: json_schema at the API level, so the common workaround is to define a single "final_answer" tool with your output schema and use that as the last step of the agent loop.
Are structured outputs more reliable than function calling?
For schema conformance, they're functionally identical, since both use constrained decoding. For task accuracy, structured outputs slightly edge out function calling on single-shape tasks because there's no routing decision to get wrong. Function calling wins on multi-action tasks where the routing decision is the point.
Does strict mode slow down responses?
The first call with a new schema pays a one-time compilation cost (typically 100-500 ms) while OpenAI builds the decoding finite-state machine. Subsequent calls reuse the cached FSM and add negligible per-token overhead. In production this is invisible, since the schema is fixed across calls.
Should I migrate off JSON mode?
Yes, on OpenAI and Gemini. JSON mode guarantees valid JSON but not schema conformance, so you still need post-validation and retry logic. Strict mode / response_schema eliminates both. The migration is usually under 10 lines of code per call site.
A production guide to prompt caching with Claude, OpenAI, and Gemini. Learn cache breakpoints, TTL strategy, prompt structure, and the seven mistakes that silently kill your cache hit rate.
Context engineering — curating what an LLM sees at inference time — is now the defining skill for AI engineers. This guide walks through the four core strategies (write, select, compress, isolate) with production Python implementations using LangGraph, reranking pipelines, and multi-agent isolation.
MCP became the standard for AI tool integration in 2026 — and attackers followed. This guide covers the MCP threat landscape and walks through four defensive layers with working Python code: tool verification, authorization middleware, runtime monitoring, and sandboxed execution.