Why Multi-Agent AI Systems Matter in 2026
The era of the single, monolithic AI agent is winding down — and honestly, it's about time. In 2026, the market for AI agent systems is projected to hit $8.5 billion, with forecasts pointing toward $35 billion by 2030. These aren't speculative numbers from some fringe analyst report: 57% of companies now deploy AI agents in production, and Gartner estimates that 40% of enterprise applications will feature task-specific agents by year's end, up from a mere 5% in 2025.
The reason is architectural, not hype-driven.
A single agent, no matter how capable the underlying model, hits a ceiling when faced with complex, multi-domain tasks. It has to hold the entire context, decide on the next action, execute it, evaluate the result, and repeat — all within a single reasoning loop. Multi-agent systems apply the same principle that's governed effective software engineering for decades: divide and conquer. By breaking complex workflows into specialized sub-tasks, each handled by a purpose-built agent, you gain modularity, fault isolation, and the ability to scale individual components independently.
This article is a practitioner's guide. We'll cover core architecture patterns for multi-agent systems, dive deep into the Model Context Protocol (MCP) and Agent-to-Agent Protocol (A2A), compare the leading frameworks, walk through production best practices, and build a working research assistant using a supervisor pattern. If you're building agent systems today — or planning to — this is the reference you need.
Core Architecture Patterns
Before selecting a framework or protocol, you need to understand the fundamental coordination patterns available to you. Each one makes different tradeoffs between simplicity, flexibility, and fault tolerance. So, let's break them down.
Sequential Pipeline
The simplest multi-agent pattern is the sequential pipeline: Agent A completes its work, passes the result to Agent B, which passes its result to Agent C, and so on. This is the right choice when your task has strict linear dependencies — where each step genuinely requires the full output of the previous step before it can begin.
A classic example is a content generation pipeline: a Research Agent gathers source material, a Drafting Agent writes the initial text, an Editing Agent refines it, and a Fact-Checking Agent validates claims. Each stage depends entirely on the output of the previous one.
The advantages are simplicity and debuggability. The disadvantages? Latency (total time is the sum of all stages) and fragility (a failure in any stage halts the entire pipeline). Use this pattern when the dependency chain is real and unavoidable.
Supervisor / Coordinator Pattern
The supervisor pattern introduces a central orchestrator agent that decomposes tasks, delegates to specialist agents, evaluates their outputs, and synthesizes a final result. This is the workhorse pattern for most production multi-agent systems — and for good reason.
A critical design rule: exactly one agent must be designated as the orchestrator to prevent coordination conflicts. If two agents both believe they're coordinating, you get duplicated work, contradictory instructions, and race conditions that are extremely difficult to debug. I've seen this happen on real projects, and untangling it is never fun.
from langgraph.graph import StateGraph, START, END
from langgraph.prebuilt import create_react_agent
from langchain_anthropic import ChatAnthropic
# Define specialist agents
research_agent = create_react_agent(
model=ChatAnthropic(model="claude-sonnet-4-20250514"),
tools=[web_search, arxiv_search],
name="researcher",
prompt="You are a research specialist. Find relevant, recent sources."
)
analysis_agent = create_react_agent(
model=ChatAnthropic(model="claude-sonnet-4-20250514"),
tools=[data_analyzer, chart_generator],
name="analyst",
prompt="You are a data analyst. Process and interpret research findings."
)
writer_agent = create_react_agent(
model=ChatAnthropic(model="claude-sonnet-4-20250514"),
tools=[text_formatter, citation_manager],
name="writer",
prompt="You are a technical writer. Produce clear, well-structured content."
)
# Supervisor decides which agent to invoke next
supervisor = create_react_agent(
model=ChatAnthropic(model="claude-sonnet-4-20250514"),
tools=[],
name="supervisor",
prompt="""You are the orchestrator. Given the user's request, decide which
specialist to delegate to next: 'researcher', 'analyst', or 'writer'.
When all work is complete, respond with FINISH."""
)
# Build the graph
workflow = StateGraph(state_schema=AgentState)
workflow.add_node("supervisor", supervisor)
workflow.add_node("researcher", research_agent)
workflow.add_node("analyst", analysis_agent)
workflow.add_node("writer", writer_agent)
# Supervisor is the entry point and the routing hub
workflow.add_edge(START, "supervisor")
workflow.add_conditional_edges(
"supervisor",
route_to_agent, # function that parses supervisor output
{
"researcher": "researcher",
"analyst": "analyst",
"writer": "writer",
"FINISH": END,
}
)
# All specialists report back to supervisor
for agent_name in ["researcher", "analyst", "writer"]:
workflow.add_edge(agent_name, "supervisor")
graph = workflow.compile()
The supervisor examines each specialist's output, decides if another specialist needs to act, and only terminates when it judges the task complete. What makes this pattern really shine is iterative refinement — the supervisor can send work back to the researcher if the analyst identifies gaps in the data.
Router Pattern
The router pattern classifies incoming requests and fans them out to the appropriate specialist, then synthesizes the results. Unlike the supervisor, the router doesn't engage in multi-turn orchestration. It makes a single routing decision (or a small number of parallel routing decisions) and then aggregates.
This pattern excels in multi-vertical knowledge bases. Imagine a support system that handles billing, technical, and account inquiries. A router agent classifies the incoming question, dispatches it to the relevant domain agent, and returns the response. If the question spans multiple domains, the router fans out to multiple specialists in parallel and merges their answers.
The router pattern is simpler and faster than a full supervisor, but it lacks the ability to do iterative, multi-step reasoning across agents. Choose it when your problem is primarily about classification and dispatch, not complex multi-step collaboration.
Handoff Pattern
In the handoff pattern, the active agent changes dynamically based on the conversation context. Rather than a central orchestrator deciding who acts, each agent can transfer control to another agent when it determines the conversation has moved outside its expertise.
This pattern is ideal for customer support flows and multi-stage conversational experiences. Think of it this way: a user starts talking to a general triage agent, which hands off to a billing specialist when money comes up, which hands off to a retention specialist when the user threatens to cancel. It feels natural because it mirrors how real support teams work.
from agents import Agent, handoff
# Define specialist agents with handoff capabilities
triage_agent = Agent(
name="Triage",
instructions="""You are the first point of contact. Determine the
customer's need and hand off to the appropriate specialist.
- Billing issues: hand off to billing_agent
- Technical problems: hand off to tech_agent
- Account changes: hand off to account_agent""",
handoffs=[
handoff(target="billing_agent",
description="Customer has billing or payment issues"),
handoff(target="tech_agent",
description="Customer has technical problems"),
handoff(target="account_agent",
description="Customer wants account changes"),
]
)
billing_agent = Agent(
name="Billing",
instructions="You handle billing inquiries. You can issue refunds and explain charges.",
tools=[lookup_invoice, process_refund, explain_charges],
handoffs=[
handoff(target="triage_agent",
description="Issue is not billing-related"),
handoff(target="retention_agent",
description="Customer wants to cancel service"),
]
)
retention_agent = Agent(
name="Retention",
instructions="You handle cancellation requests. Offer retention deals before processing.",
tools=[get_retention_offers, process_cancellation, apply_discount],
handoffs=[
handoff(target="billing_agent",
description="Customer accepted a deal, needs billing adjustment"),
]
)
The key advantage here is that it produces a natural conversational flow. The user doesn't experience jarring context switches because each agent receives the full conversation history. The key risk? Poorly designed handoff conditions can create loops or dead ends — so test your handoff logic thoroughly.
Swarm Pattern
The swarm pattern is a fully decentralized variant of the handoff pattern. There's no designated orchestrator. Agents dynamically hand off to each other based on their specializations, forming emergent coordination patterns. OpenAI's Swarm framework popularized this approach.
Swarms work best when the problem space is well-partitioned and each agent has a clear, non-overlapping specialization. They struggle when tasks require global coordination or when the problem decomposition is ambiguous. In practice, most production systems use a hybrid: swarm-like handoffs within a domain, supervised by a higher-level orchestrator across domains.
Model Context Protocol (MCP): The Agent-to-Tool Standard
If multi-agent architecture is the skeleton, the Model Context Protocol is the nervous system. Announced by Anthropic in November 2024 and now governed by the Linux Foundation's Agentic AI Foundation, MCP has become the de facto standard for agent-to-tool communication. The analogy that stuck (and it's a good one) is "USB-C for AI": a single, standardized interface that lets any agent connect to any tool.
The Three Primitives
MCP defines three core primitives, each designed for a different interaction pattern:
- Tools are action-oriented, analogous to POST requests. They perform operations: searching a database, sending an email, creating a record. Tools are model-controlled — the AI decides when and how to invoke them.
- Resources are data-oriented, analogous to GET requests. They expose read-only data: configuration files, database schemas, documentation. Resources are application-controlled, meaning the host application decides which resources to surface to the model.
- Prompts are reusable templates that encode common interaction patterns. They're user-controlled — the user or application selects which prompt template to activate for a given interaction.
This three-way split is deliberate. It maps cleanly to the three control planes in any agent system: what the model decides to do (tools), what the application provides as context (resources), and what the user selects as the interaction mode (prompts).
Building an MCP Server
The Python SDK makes it pretty straightforward to expose your services as MCP-compatible tools and resources:
import json
from contextlib import asynccontextmanager
from dataclasses import dataclass
from mcp.server.fastmcp import FastMCP, Context
# Lifespan management for shared resources
@dataclass
class AppContext:
db: DatabaseConnection
cache: RedisCache
@asynccontextmanager
async def app_lifespan(server: FastMCP):
"""Manage application lifecycle - setup and teardown."""
db = await DatabaseConnection.connect("postgresql://localhost/agents")
cache = await RedisCache.connect("redis://localhost:6379")
try:
yield AppContext(db=db, cache=cache)
finally:
await db.disconnect()
await cache.disconnect()
# Initialize the MCP server with lifespan management
mcp = FastMCP("Research Assistant Tools", lifespan=app_lifespan)
@mcp.tool()
async def search_database(query: str, ctx: Context) -> str:
"""Search the knowledge base for relevant documents.
Args:
query: Natural language search query
"""
app = ctx.request_context.lifespan_context
# Report progress for long-running operations
await ctx.report_progress(0, 100, "Starting search...")
results = await app.db.semantic_search(query, limit=10)
await ctx.report_progress(50, 100, "Processing results...")
formatted = format_search_results(results)
await ctx.report_progress(100, 100, "Search complete")
return formatted
@mcp.tool()
async def analyze_dataset(dataset_id: str, analysis_type: str, ctx: Context) -> str:
"""Run statistical analysis on a dataset.
Args:
dataset_id: Unique identifier for the dataset
analysis_type: Type of analysis (summary, correlation, regression)
"""
app = ctx.request_context.lifespan_context
dataset = await app.db.get_dataset(dataset_id)
if not dataset:
return json.dumps({"error": f"Dataset {dataset_id} not found"})
result = await run_analysis(dataset, analysis_type)
# Cache the result for future reference
cache_key = f"analysis:{dataset_id}:{analysis_type}"
await app.cache.set(cache_key, json.dumps(result), ttl=3600)
return json.dumps(result)
@mcp.resource("config://settings")
def get_settings() -> str:
"""Expose application settings as a readable resource."""
return json.dumps({
"max_search_results": 10,
"supported_analyses": ["summary", "correlation", "regression"],
"model_version": "2026.1"
})
@mcp.resource("schema://database")
def get_schema() -> str:
"""Expose database schema for the model to understand data structure."""
return json.dumps({
"tables": {
"documents": ["id", "title", "content", "embedding", "created_at"],
"datasets": ["id", "name", "columns", "row_count", "source"],
"analyses": ["id", "dataset_id", "type", "result", "timestamp"]
}
})
if __name__ == "__main__":
mcp.run(transport="streamable-http")
A few things worth noting here. The lifespan pattern uses an async context manager to ensure database connections and cache clients are properly initialized at startup and cleaned up at shutdown — which is something you'll definitely want in production. The progress reporting API lets the client display meaningful status updates for long-running operations. And the resource definitions give the model structural context (like the database schema) without requiring a tool call.
2026 Enhancements
The MCP specification continues to evolve rapidly under open governance. Key enhancements in 2026 include multimodal support, allowing tools and resources to handle images, video, and audio natively. A tool can now return an annotated image or accept an audio clip as input, with the protocol handling serialization and streaming. The move to the Linux Foundation has also accelerated adoption, with major cloud providers now offering managed MCP server hosting.
Agent-to-Agent Protocol (A2A)
While MCP standardizes how agents talk to tools, the Agent-to-Agent Protocol (A2A) standardizes how agents talk to each other. Launched by Google with more than 50 partners (including Atlassian, Salesforce, and SAP), A2A is now also governed under the Linux Foundation. The two protocols are complementary: MCP handles the vertical integration (agent to tool), A2A handles the horizontal integration (agent to agent).
Core Concepts
A2A is built on three foundational ideas:
- Agent Cards: JSON metadata documents that describe an agent's capabilities, accepted input formats, authentication requirements, and endpoint URLs. Agent Cards enable discovery — a coordinator agent can query a registry of Agent Cards to find the right specialist for a given task, without any hardcoded knowledge of available agents.
- JSON-RPC 2.0 over HTTP(S): The wire protocol is deliberately simple and familiar. Every A2A interaction is a standard JSON-RPC call, making it easy to implement, debug, and integrate with existing infrastructure.
- Flexible communication modes: A2A supports synchronous request-response, streaming via Server-Sent Events (SSE), and asynchronous push notifications. This means you can use A2A for quick question-answer exchanges, long-running streaming analyses, and fire-and-forget background tasks.
Version 0.3 of the specification added gRPC support, enabling high-performance binary communication for latency-sensitive, high-throughput inter-agent communication — think real-time trading systems or autonomous vehicle coordination.
A2A in Practice
Consider a scenario where your organization has independently developed agents across different teams: a legal compliance agent, a financial analysis agent, and a customer insights agent. Without A2A, integrating these requires custom point-to-point integrations (and trust me, that gets messy fast). With A2A, each agent publishes an Agent Card, and any other agent or orchestrator can discover and invoke it through the standardized protocol.
The combination of MCP and A2A creates a powerful interoperability layer. An orchestrator agent uses A2A to delegate a sub-task to a specialist agent. That specialist agent uses MCP to interact with its tools. The result flows back through A2A to the orchestrator, which synthesizes the final output. Neither protocol alone is sufficient; together, they form the complete communication infrastructure for multi-agent systems.
Framework Comparison: Choosing the Right Tool
The multi-agent framework landscape has matured considerably. Here's a practical comparison of the four leading options, based on benchmarks and real-world usage patterns.
LangGraph
LangGraph, built on top of LangChain, models agent workflows as directed graphs with explicit state management. Benchmark results show it's the fastest framework — approximately 2.2 times faster than CrewAI on equivalent tasks. It's also the most token-efficient, using roughly 2,589 tokens on standard benchmark tasks compared to 5,339 for CrewAI.
LangGraph's strength is fine-grained control. You explicitly define nodes (agents or functions), edges (transitions), and conditional routing logic. This makes complex workflows transparent and debuggable. The tradeoff? A steeper learning curve and more boilerplate code compared to higher-level abstractions.
Choose LangGraph when: you need maximum performance and control, your workflow has complex branching logic, or you need to optimize token usage and cost.
CrewAI
CrewAI takes a role-based team metaphor. You define agents as team members with specific roles, goals, and backstories. Tasks are assigned to agents, and the framework handles coordination. CrewAI uses the most context per task (approximately 5,339 tokens), which means agents have richer context but at higher cost. Latency on complex tasks is around 9 seconds.
CrewAI's real strength is developer experience. Defining a team of agents and a set of tasks is intuitive and requires minimal boilerplate. If you've ever organized a team meeting, the mental model transfers pretty well.
Choose CrewAI when: you want rapid prototyping, your team is less experienced with agent architectures, or when comprehensive context passing between agents matters more than raw speed.
AutoGen (Microsoft)
AutoGen models multi-agent interactions as conversations. Agents talk to each other in a chat-like format, with configurable conversation patterns (two-agent chat, group chat, nested chat). Token usage is balanced at approximately 3,316 tokens per task.
AutoGen's strength is flexibility in defining conversational collaboration patterns, especially for tasks that benefit from debate, critique, and iterative refinement. It particularly excels at code generation tasks where one agent writes code and another reviews it.
Choose AutoGen when: your task benefits from agent-to-agent deliberation, you need collaborative code generation, or you want conversational patterns that feel more natural than graph-based workflows.
OpenAI Agents SDK
The OpenAI Agents SDK is the newest entrant, providing native MCP support and built-in handoff primitives. It's designed to be minimal and opinionated: agents, handoffs, and guardrails are the core abstractions. The SDK integrates tightly with OpenAI's model ecosystem but can work with other providers.
Choose OpenAI Agents SDK when: you're building primarily on OpenAI models, you need built-in handoff and guardrail support, or you want the simplest possible API for common agent patterns.
Comparison Summary
No single framework is universally best. The decision matrix looks like this:
- Performance-critical, cost-sensitive: LangGraph
- Rapid prototyping, team-based metaphor: CrewAI
- Conversational collaboration, code review: AutoGen
- Handoff-heavy, OpenAI-native: OpenAI Agents SDK
In practice, many production systems use combinations. A LangGraph supervisor orchestrating CrewAI sub-teams is a perfectly valid — and increasingly common — architecture.
Production Best Practices
Building multi-agent systems that work in demos is straightforward. Building ones that work in production? That's a different discipline entirely. Here are the patterns and practices that separate toy systems from reliable ones.
Reliability Lives and Dies in the Handoffs
This is probably the single most important insight for multi-agent production systems: most agent failures are orchestration and context-transfer issues, not model failures. The underlying model usually produces good outputs. The system breaks when Agent A's output isn't properly formatted for Agent B's input, when context gets lost during handoffs, or when the orchestrator makes a poor routing decision.
The solution is structured handoff schemas. Don't pass free-form text between agents. Use explicit, typed structures:
from dataclasses import dataclass, field
from typing import Optional
@dataclass
class AgentHandoff:
"""Structured schema for agent-to-agent handoffs."""
# What was accomplished
summary: str
# Sources and evidence
citations: list[str] = field(default_factory=list)
evidence_map: dict[str, list[str]] = field(default_factory=dict)
# What remains unclear
open_questions: list[str] = field(default_factory=list)
# Confidence and reliability signals
confidence: float = 0.0 # 0.0 to 1.0
# State for downstream agents
tool_state: dict = field(default_factory=dict)
# Metadata
source_agent: str = ""
timestamp: str = ""
token_usage: int = 0
def to_context_prompt(self) -> str:
"""Convert handoff to a structured prompt for the next agent."""
sections = [
f"## Previous Agent Summary\n{self.summary}",
f"## Confidence: {self.confidence:.0%}",
]
if self.citations:
sections.append(
"## Sources\n" + "\n".join(f"- {c}" for c in self.citations)
)
if self.open_questions:
sections.append(
"## Open Questions\n" + "\n".join(f"- {q}" for q in self.open_questions)
)
return "\n\n".join(sections)
This structure forces each agent to be explicit about what it found, how confident it is, what evidence supports its conclusions, and what questions remain. The downstream agent receives structured context rather than an ambiguous blob of text. It's a small investment that pays off enormously in debugging time.
Memory as Infrastructure
Agent memory can't be an afterthought. In production, you need tiered memory storage with different persistence and access patterns:
- Working memory: The current conversation context. Fast, ephemeral, held in the agent's context window.
- Short-term memory: Session-level state. Persisted across turns within a session but not across sessions. Typically stored in Redis or a similar in-memory store.
- Long-term memory: Cross-session knowledge. User preferences, past interactions, learned patterns. Stored in a vector database or persistent store.
- Semantic cache: Cached responses for semantically similar queries. This alone can deliver up to 90% cost reduction and 15x faster response times for repeated or similar queries.
The semantic cache layer deserves special attention. Unlike exact-match caches, a semantic cache uses embedding similarity to find previous responses that are close enough to reuse. If a user asks "What were Q3 revenue numbers?" and someone previously asked "Show me third quarter revenue," the semantic cache can serve the previous result without invoking any agents at all. Pretty clever, right?
Stateful vs. Stateless Tradeoffs
Stateless agent designs are easier to scale horizontally but require reconstructing context on every invocation. Stateful designs maintain continuity but create challenges for load balancing, failover, and horizontal scaling.
Most production systems use a stateless-with-external-state pattern: the agents themselves are stateless, but they read from and write to an external state store (Redis, DynamoDB, or a dedicated state management service). This gives you the scalability of stateless with the continuity of stateful. It's the best of both worlds, assuming you can tolerate the extra latency of external state lookups.
Observability
You can't debug what you can't observe. Every production multi-agent system needs:
- Audit trails for all tool calls: what tool was called, with what parameters, what it returned, and how long it took.
- Decision traces: why the orchestrator chose Agent B instead of Agent C, including the reasoning that led to the routing decision.
- Token usage tracking: per-agent, per-turn, and per-session — both for cost management and for detecting runaway loops.
- Latency breakdowns: where time is actually being spent, whether in model inference, tool execution, or network overhead.
OpenTelemetry has emerged as the standard instrumentation layer for agent systems, with several agent-specific extensions available for tracing multi-step workflows.
Security
Multi-agent systems expand the attack surface significantly. Each tool an agent can access is a potential vector. Essential security measures include:
- Role-Based Access Control (RBAC): different agents should have different permission levels. A research agent shouldn't have access to the payment processing tool. (This seems obvious, but you'd be surprised how often it's overlooked.)
- OAuth 2.0 integration: MCP supports OAuth for tool authentication. Use it. Don't pass API keys through agent context windows.
- Rate limiting: both per-agent and per-tool. A malfunctioning agent in a loop can rack up enormous costs and overload external services.
- Input sanitization: agents can be prompt-injected. Validate and sanitize all inputs before passing them to tools, especially tools that execute code or modify data.
Testing Multi-Agent Systems
Traditional unit testing is necessary but not sufficient. Multi-agent systems require three additional testing layers:
- Simulation testing: run the entire multi-agent system against simulated inputs and validate the end-to-end output. Use deterministic model outputs (mocked responses) to make tests reproducible.
- Integration testing: test the actual agent-to-tool and agent-to-agent communication paths. Verify that handoff schemas are correctly serialized and deserialized, that tools return expected formats, and that error conditions are handled gracefully.
- Chaos testing: deliberately inject failures. What happens when a tool times out? When an agent returns malformed output? When the orchestrator's context window fills up? Chaos testing reveals the failure modes that simulations and integration tests miss.
Practical Example: Building a Research Assistant
Let's tie everything together with a complete, working example: a multi-agent research assistant using the supervisor pattern, LangGraph for orchestration, and MCP for tool integration. The system has three specialist agents coordinated by a supervisor.
System Architecture
The system comprises four agents:
- Supervisor: decomposes the research question, delegates to specialists, evaluates progress, and synthesizes the final report.
- Research Agent: performs web searches and retrieves source documents.
- Analysis Agent: processes and analyzes the retrieved data, identifies patterns, and generates insights.
- Writer Agent: transforms research findings and analysis into a coherent, well-structured report.
MCP Tool Server
First, we define the tools that our agents will use, exposed as an MCP server:
import json
import httpx
from mcp.server.fastmcp import FastMCP
mcp = FastMCP("Research Tools")
@mcp.tool()
async def web_search(query: str, max_results: int = 5) -> str:
"""Search the web for information on a topic.
Args:
query: Search query string
max_results: Maximum number of results to return
"""
async with httpx.AsyncClient() as client:
response = await client.get(
"https://api.search-provider.com/search",
params={"q": query, "count": max_results},
headers={"Authorization": f"Bearer {SEARCH_API_KEY}"}
)
results = response.json()
formatted = []
for r in results.get("results", []):
formatted.append({
"title": r["title"],
"url": r["url"],
"snippet": r["snippet"],
"published": r.get("datePublished", "unknown")
})
return json.dumps(formatted, indent=2)
@mcp.tool()
async def fetch_document(url: str) -> str:
"""Fetch and extract text content from a URL.
Args:
url: The URL to fetch content from
"""
async with httpx.AsyncClient(follow_redirects=True) as client:
response = await client.get(url, timeout=30.0)
text = extract_text(response.text)
return text[:10000] # Truncate to manage context window
@mcp.tool()
async def analyze_data(data: str, analysis_type: str) -> str:
"""Perform statistical or qualitative analysis on data.
Args:
data: JSON string of data to analyze
analysis_type: One of 'summarize', 'compare', 'trend', 'sentiment'
"""
parsed = json.loads(data)
if analysis_type == "summarize":
result = generate_summary(parsed)
elif analysis_type == "compare":
result = generate_comparison(parsed)
elif analysis_type == "trend":
result = identify_trends(parsed)
elif analysis_type == "sentiment":
result = analyze_sentiment(parsed)
else:
result = {"error": f"Unknown analysis type: {analysis_type}"}
return json.dumps(result, indent=2)
@mcp.tool()
async def save_report(title: str, content: str, format: str = "markdown") -> str:
"""Save a completed research report.
Args:
title: Report title
content: Report content
format: Output format (markdown or html)
"""
filename = f"reports/{slugify(title)}.{format}"
async with aiofiles.open(filename, "w") as f:
await f.write(content)
return json.dumps({"status": "saved", "path": filename})
Agent Definitions and Orchestration Graph
Now we build the LangGraph supervisor that coordinates our three specialists:
from typing import TypedDict, Annotated, Literal
from langgraph.graph import StateGraph, START, END
from langgraph.graph.message import add_messages
from langgraph.prebuilt import create_react_agent
from langchain_anthropic import ChatAnthropic
from langchain_core.messages import HumanMessage, SystemMessage
# Shared state across all agents
class ResearchState(TypedDict):
messages: Annotated[list, add_messages]
research_data: dict
analysis_results: dict
report_draft: str
iteration_count: int
current_phase: str
# Initialize the model
model = ChatAnthropic(model="claude-sonnet-4-20250514", temperature=0)
# --- Define Specialist Agents ---
research_agent = create_react_agent(
model=model,
tools=[web_search, fetch_document],
name="researcher",
prompt="""You are an expert research agent. Your job is to:
1. Search the web for relevant, authoritative sources
2. Fetch and read key documents
3. Extract factual information with proper citations
Always cite your sources."""
)
analysis_agent = create_react_agent(
model=model,
tools=[analyze_data],
name="analyst",
prompt="""You are a data analysis specialist. Your job is to:
1. Process research findings to identify key patterns
2. Compare information across sources for consistency
3. Flag contradictions or gaps in the research
4. Provide quantitative summaries where possible
Be rigorous. Distinguish between well-supported claims and speculation."""
)
writer_agent = create_react_agent(
model=model,
tools=[save_report],
name="writer",
prompt="""You are a technical writer. Your job is to:
1. Synthesize research and analysis into a clear report
2. Structure the report with sections and subsections
3. Include citations for all factual claims
Produce publication-ready content."""
)
# --- Supervisor Logic ---
SUPERVISOR_PROMPT = """You are the research supervisor coordinating a team of
three specialists: researcher, analyst, and writer.
Given the current state, decide what to do next:
- "researcher": if more information is needed
- "analyst": if gathered data needs processing
- "writer": if analysis is complete and a report should be drafted
- "FINISH": if the final report is complete and satisfactory
You MUST complete the task within 10 iterations."""
def supervisor_node(state: ResearchState) -> dict:
"""Supervisor decides the next agent to invoke."""
if state.get("iteration_count", 0) >= 10:
return {
"messages": [SystemMessage(content="Maximum iterations reached.")],
"current_phase": "FINISH"
}
context = f"""Current phase: {state.get('current_phase', 'start')}
Research data collected: {'yes' if state.get('research_data') else 'no'}
Analysis completed: {'yes' if state.get('analysis_results') else 'no'}
Report drafted: {'yes' if state.get('report_draft') else 'no'}
Iteration: {state.get('iteration_count', 0)}/10"""
messages = [
SystemMessage(content=SUPERVISOR_PROMPT),
HumanMessage(content=context)
] + state["messages"][-5:]
response = model.invoke(messages)
next_agent = parse_routing_decision(response.content)
return {
"messages": [response],
"current_phase": next_agent,
"iteration_count": state.get("iteration_count", 0) + 1
}
def route_supervisor(state: ResearchState) -> Literal[
"researcher", "analyst", "writer", "finish"
]:
"""Route to the next agent based on supervisor decision."""
phase = state.get("current_phase", "researcher")
if phase == "FINISH":
return "finish"
return phase
# --- Build the Graph ---
workflow = StateGraph(ResearchState)
workflow.add_node("supervisor", supervisor_node)
workflow.add_node("researcher", research_agent)
workflow.add_node("analyst", analysis_agent)
workflow.add_node("writer", writer_agent)
workflow.add_edge(START, "supervisor")
workflow.add_conditional_edges(
"supervisor",
route_supervisor,
{
"researcher": "researcher",
"analyst": "analyst",
"writer": "writer",
"finish": END,
}
)
for name in ["researcher", "analyst", "writer"]:
workflow.add_edge(name, "supervisor")
research_assistant = workflow.compile()
# --- Run the System ---
async def run_research(question: str) -> str:
"""Execute a research task through the multi-agent system."""
initial_state = {
"messages": [HumanMessage(content=question)],
"research_data": {},
"analysis_results": {},
"report_draft": "",
"iteration_count": 0,
"current_phase": "start",
}
final_state = await research_assistant.ainvoke(initial_state)
return final_state.get("report_draft", "Research incomplete.")
What This Example Demonstrates
This research assistant illustrates several key principles we've discussed throughout this article:
- Single orchestrator: the supervisor is the only decision-maker, preventing coordination conflicts.
- Typed state:
ResearchStateprovides a structured contract between agents — not free-form text passing. - Loop protection: the
iteration_countguard prevents runaway costs from infinite delegation loops. - MCP tool integration: tools are defined as MCP-compatible functions, ready to be served over the standard protocol.
- Separation of concerns: each agent has a focused responsibility and a specialized prompt that constrains its behavior.
In a production deployment, you'd add observability (OpenTelemetry spans around each node), persistent state (checkpointing intermediate results to a database), authentication (OAuth tokens for tool access), and comprehensive error handling (retries, fallbacks, and graceful degradation). But the core architecture shown here is the foundation everything else builds on.
Where Multi-Agent Systems Are Headed
Several trends are converging to shape the next phase of multi-agent AI.
Human-on-the-Loop Orchestration
The most significant shift is from human-in-the-loop (human approves every action) to human-on-the-loop (human sets policies, monitors outcomes, and intervenes only when needed). This is driven by practical necessity: as agent systems handle more tasks in parallel, per-action human approval becomes a bottleneck. Instead, organizations are defining guardrails, budgets, and escalation criteria that let agent systems operate autonomously within defined boundaries.
From Agent Users to Agent Bosses
The role of the human operator is evolving from someone who uses an agent to someone who manages a team of agents. This requires new skills: defining clear objectives, setting appropriate autonomy levels, designing escalation paths, and interpreting agent decision traces. It's management, not prompting — and honestly, that shift catches a lot of people off guard.
Protocol Convergence
The MCP and A2A ecosystems are rapidly converging under the Linux Foundation's governance. The emerging architecture is clear: MCP for agent-to-tool communication, A2A for agent-to-agent communication, and a shared discovery and authentication layer that spans both. Organizations that adopt these standards now will have interoperable agent systems. Those that build on proprietary protocols will face costly migrations later.
Enterprise Readiness
Multi-agent systems are crossing the enterprise readiness threshold in 2026. The combination of standardized protocols, mature frameworks, established best practices for observability and security, and growing organizational experience with agent operations means the question is no longer whether to build multi-agent systems but how to build them well.
The organizations that will lead are those investing in three things: strong architectural foundations (the patterns described in this article), protocol adoption (MCP and A2A, not custom integrations), and operational excellence (observability, testing, and security as first-class concerns, not afterthoughts). The multi-agent future isn't coming — it's here. The question is whether you're building it on solid ground.