AI Workflow Orchestration in Production: Durable Agent Pipelines with LangGraph and Temporal

A hands-on guide to production-grade AI workflow orchestration — covering LangGraph checkpointing, Temporal durable execution with retry policies and Saga compensation, and the two-layer architecture for mission-critical agent pipelines.

Every week, another team posts a demo of an AI agent doing something impressive — researching a topic, writing code, booking travel, analyzing documents. The demo works flawlessly. Then they try to run it in production, and everything falls apart.

The agent crashes halfway through a ten-minute workflow. The LLM times out and the entire pipeline restarts from scratch, burning another five dollars in tokens. A server restarts at 3 AM and the customer's partially completed task just... vanishes. No state was saved. No recovery was possible. The work is simply gone.

This isn't a hypothetical scenario. It's the defining failure mode of production AI systems right now. The gap between a demo agent and a production agent isn't the model, the prompt, or the tools — it's the orchestration layer. That's the execution infrastructure that makes agent workflows durable, fault-tolerant, and recoverable. Without it, you're building on sand.

This article is a deep dive into the two technologies that have emerged as the production standard for AI workflow orchestration: LangGraph, the graph-based workflow engine from the LangChain ecosystem, and Temporal, the battle-tested durable execution platform. We'll look at each independently, then explore the increasingly popular two-layer architecture that combines them for mission-critical agent systems. By the end, you'll have a clear framework for choosing the right orchestration strategy — along with working code examples you can adapt right away.

Why AI Workflows Need Orchestration

Let's be precise about the problem. LLM calls have properties that make them uniquely difficult to orchestrate compared to traditional API calls:

  • They're expensive. A single GPT-4o call with a large context window can cost $0.10 or more. A multi-step agent workflow might make dozens of calls. Losing progress and restarting from scratch has a real dollar cost.
  • They're non-deterministic. The same input can produce different outputs. This makes naive retry logic dangerous — you can't simply replay a sequence of calls and expect the same intermediate states.
  • They're slow. A complex LLM call with tool use can take 30 seconds to over 2 minutes. Standard HTTP timeout defaults will kill these requests prematurely.
  • They fail in novel ways. Rate limits, content filters, malformed JSON outputs, hallucinated tool calls — the failure modes go well beyond "connection refused."

Now consider a multi-step agent workflow: a research agent that searches the web, synthesizes findings, generates a report, gets human approval, then publishes the result. That's five discrete steps, each involving LLM calls, each capable of failing independently. Without orchestration, you're left with a fragile chain of API calls wrapped in try/except blocks, with no state persistence, no crash recovery, and no ability to pause for human input.

Production workflows need fundamentally different infrastructure:

  • State persistence: Save intermediate results so that a crash at step four doesn't lose the work from steps one through three.
  • Configurable retry logic: Retry failed LLM calls with exponential backoff, but respect rate limits and budget constraints.
  • Compensation (rollback): If step four fails and you can't continue, undo the side effects of steps one through three.
  • Human-in-the-loop pauses: Pause execution for minutes, hours, or days while a human reviews and approves an intermediate result, then resume exactly where you left off.
  • Observability: Trace every step, every LLM call, every state transition so you can debug failures in production.

That's the difference between "chain a few API calls" and "production workflow." The rest of this article covers the tools that bridge that gap.

LangGraph: Graph-Based Agent Workflows

LangGraph hit version 1.0 in October 2025, and it's become the go-to workflow engine in the LangChain ecosystem for building stateful, multi-step agent applications. Its core abstraction is the StateGraph: a directed graph where nodes represent computation steps, edges define transitions between steps, and a typed state object flows through the entire graph.

StateGraph Architecture: Nodes, Edges, and Conditional Routing

A LangGraph workflow is defined as a graph with three components. Nodes are Python functions that receive the current state and return updates to it. Edges connect nodes and define the flow of execution. Conditional edges are functions that examine the state and route execution to different nodes based on runtime conditions — this is what gives agents their decision-making capability.

from langgraph.graph import StateGraph, START, END
from typing import TypedDict, Annotated
from operator import add

class ResearchState(TypedDict):
    query: str
    sources: Annotated[list[str], add]
    synthesis: str
    approval_status: str
    final_report: str

def search_node(state: ResearchState) -> dict:
    """Search for relevant sources based on the query."""
    results = web_search(state["query"])
    return {"sources": [r.url for r in results]}

def synthesize_node(state: ResearchState) -> dict:
    """Use an LLM to synthesize findings from sources."""
    response = llm.invoke(
        f"Synthesize these sources into a research brief:\n"
        f"Query: {state['query']}\n"
        f"Sources: {state['sources']}"
    )
    return {"synthesis": response.content}

def route_after_approval(state: ResearchState) -> str:
    """Route based on human approval status."""
    if state["approval_status"] == "approved":
        return "publish"
    return "revise"

builder = StateGraph(ResearchState)
builder.add_node("search", search_node)
builder.add_node("synthesize", synthesize_node)
builder.add_node("human_review", human_review_node)
builder.add_node("publish", publish_node)
builder.add_node("revise", revise_node)

builder.add_edge(START, "search")
builder.add_edge("search", "synthesize")
builder.add_edge("synthesize", "human_review")
builder.add_conditional_edges("human_review", route_after_approval)
builder.add_edge("publish", END)
builder.add_edge("revise", "synthesize")

graph = builder.compile()

The Annotated[list[str], add] syntax is worth calling out. LangGraph uses reducer functions to define how state updates are merged. Here, add means new sources get appended to the existing list rather than replacing it. This is critical for accumulating results across multiple iterations — miss this detail and you'll spend an hour wondering why your list keeps resetting.

Checkpointing and Durable Execution

Checkpointing is what transforms a LangGraph workflow from a fragile in-memory process into a durable, recoverable execution. When you compile a graph with a checkpointer, LangGraph saves a complete snapshot of the state at every superstep — basically every time the graph transitions between nodes. If the process crashes, you can resume from the last checkpoint rather than starting over.

LangGraph provides three checkpointer backends:

  • InMemorySaver: Stores checkpoints in RAM. Useful for local development and testing only. Data is lost when the process stops. Don't use this in production — seriously.
  • PostgresSaver / AsyncPostgresSaver: Stores checkpoints in PostgreSQL. This is the recommended production backend, supporting both sync and async usage.
  • DynamoDBSaver: Stores checkpoints in AWS DynamoDB. A solid choice for serverless deployments on AWS.

Here's a complete example of setting up a production-grade checkpointed workflow:

from langgraph.checkpoint.postgres import PostgresSaver
from psycopg_pool import ConnectionPool

DB_URI = "postgresql://agent_user:secure_pass@localhost:5432/agent_db"

# Create a connection pool for production use
with ConnectionPool(
    conninfo=DB_URI,
    min_size=2,
    max_size=10,
    kwargs={"autocommit": True, "row_factory": dict_row},
) as pool:
    checkpointer = PostgresSaver(pool)

    # Run setup() once during deployment to create schema
    # In production, run this in your CI/CD migration step
    checkpointer.setup()

    # Compile the graph with the checkpointer
    graph = builder.compile(checkpointer=checkpointer)

    # Every invocation needs a thread_id for state isolation
    config = {"configurable": {"thread_id": "research-task-42"}}

    # First invocation: starts from scratch
    result = graph.invoke(
        {"query": "Impact of EU AI Act on LLM deployment"},
        config=config,
    )

    # If the process crashes and restarts, invoking with the
    # same thread_id resumes from the last checkpoint
    result = graph.invoke(None, config=config)

Two critical production details worth highlighting: first, always pass autocommit=True and row_factory=dict_row when creating Postgres connections for the checkpointer. Second, handle the .setup() call in your CI/CD migration pipeline, not in your application's hot path. I've seen teams accidentally call setup() on every request — it works, but it's wasteful.

Human-in-the-Loop with Interrupts

LangGraph provides two mechanisms for pausing execution and waiting for human input. Static interrupts use interrupt_before or interrupt_after to pause at node boundaries. Dynamic interrupts use the interrupt() function to pause inside a node based on runtime conditions.

from langgraph.types import interrupt, Command

def human_review_node(state: ResearchState) -> dict:
    """Pause for human review of the synthesized report."""
    # This pauses execution and stores the message in the checkpoint
    decision = interrupt(
        {
            "message": "Please review the synthesis below and approve or reject.",
            "synthesis": state["synthesis"],
            "options": ["approved", "rejected", "needs_revision"],
        }
    )
    return {"approval_status": decision}

# Compile with the interrupt configured
graph = builder.compile(
    checkpointer=checkpointer,
    interrupt_before=["human_review"],  # static alternative
)

# Start the workflow — it pauses at human_review
config = {"configurable": {"thread_id": "review-task-7"}}
graph.invoke({"query": "Quarterly market analysis"}, config=config)

# Hours or days later, a human reviews and resumes
graph.invoke(Command(resume="approved"), config=config)

Here's a gotcha that trips people up: when a graph resumes from an interrupt(), the entire node re-executes from the beginning. That means any logic before the interrupt() call must be idempotent. If you make an API call before pausing, it will execute again on resume. Wrap those operations in idempotency checks or (better yet) move them to a separate preceding node.

Fan-Out and Fan-In with the Send API

The Send API enables dynamic parallel dispatch — a critical pattern for agent workflows that need to process multiple items concurrently. Each Send creates an independent execution path that runs in its own concurrent thread within a single superstep.

from langgraph.graph import Send

def dispatch_research(state: ResearchState) -> list[Send]:
    """Fan out: dispatch parallel research tasks for each source."""
    return [
        Send("analyze_source", {"source_url": url, "query": state["query"]})
        for url in state["sources"]
    ]

def analyze_source(state: dict) -> dict:
    """Analyze a single source — runs in parallel for each Send."""
    analysis = llm.invoke(
        f"Analyze this source for relevance to: {state['query']}\n"
        f"Source: {state['source_url']}"
    )
    return {"analyses": [analysis.content]}

def aggregate_results(state: ResearchState) -> dict:
    """Fan in: combine all parallel analysis results."""
    combined = "\n\n".join(state["analyses"])
    summary = llm.invoke(f"Synthesize these analyses:\n{combined}")
    return {"synthesis": summary.content}

builder.add_node("dispatch", dispatch_research)
builder.add_node("analyze_source", analyze_source)
builder.add_node("aggregate", aggregate_results)

builder.add_edge("search", "dispatch")
builder.add_conditional_edges("dispatch", dispatch_research)
builder.add_edge("analyze_source", "aggregate")

All nodes within a superstep must complete before execution proceeds. If one parallel branch fails, the entire superstep fails atomically — but with checkpointing enabled, only the failing branches need to retry. You can control concurrency with graph.invoke(inputs, config={"max_concurrency": 10}) to avoid overwhelming downstream APIs.

Temporal: Durable Execution for Mission-Critical AI Workflows

Temporal is an open-source durable execution platform that's been battle-tested across thousands of production deployments at companies like Netflix, Stripe, and Snap. It takes a fundamentally different approach from LangGraph: rather than providing agent-specific abstractions, Temporal gives you general-purpose durable execution that guarantees your code will run to completion regardless of infrastructure failures.

Honestly, that guarantee is hard to overstate. When you've been woken up at 3 AM because a half-finished workflow corrupted your data, "guaranteed completion" starts to sound pretty appealing.

Core Concepts: Workflows vs Activities

Temporal's architecture cleanly separates two types of code:

  • Workflows are deterministic functions that define the orchestration logic — the sequence, conditions, and error handling of your pipeline. Workflows must not perform I/O directly: no network calls, no file reads, no random number generation. This determinism constraint is what enables Temporal's replay-based recovery.
  • Activities are non-deterministic functions that do the actual work: making LLM calls, querying databases, calling external APIs. Activities can fail and be retried independently without affecting the workflow's logical state.

The Temporal Cluster (either self-hosted or Temporal Cloud) maintains the complete event history of every workflow execution. Workers are stateless processes that poll the cluster for tasks, execute a workflow step or activity, and report the result back. Workers can be scaled horizontally, crashed, and restarted without affecting workflow integrity — the cluster is the source of truth.

Retry Policies for LLM Calls

Temporal's retry policies are far more sophisticated than simple try/except loops. For LLM calls specifically, you need to account for their unique failure characteristics:

import asyncio
from datetime import timedelta
from temporalio import activity, workflow
from temporalio.common import RetryPolicy

# Activity: wraps an LLM call with proper timeout and retry config
@activity.defn
async def call_llm_activity(prompt: str) -> str:
    """Make an LLM call as a Temporal activity."""
    response = await openai_client.chat.completions.create(
        model="gpt-4o",
        messages=[{"role": "user", "content": prompt}],
        timeout=120,  # LLM-specific timeout
    )
    return response.choices[0].message.content

@activity.defn
async def search_web_activity(query: str) -> list[str]:
    """Search the web and return source URLs."""
    results = await search_client.search(query, max_results=10)
    return [r.url for r in results]

@activity.defn
async def publish_report_activity(report: str, target: str) -> str:
    """Publish the final report to the target destination."""
    doc_id = await publish_client.create_document(
        content=report,
        destination=target,
    )
    return doc_id

# Workflow: orchestrates the activities
@workflow.defn
class ResearchWorkflow:
    @workflow.run
    async def run(self, query: str, target: str) -> str:
        # Step 1: Search for sources
        sources = await workflow.execute_activity(
            search_web_activity,
            query,
            start_to_close_timeout=timedelta(seconds=30),
            retry_policy=RetryPolicy(
                initial_interval=timedelta(seconds=1),
                maximum_interval=timedelta(seconds=10),
                maximum_attempts=3,
            ),
        )

        # Step 2: Synthesize with LLM — note the longer timeout
        synthesis = await workflow.execute_activity(
            call_llm_activity,
            f"Synthesize research on: {query}\nSources: {sources}",
            start_to_close_timeout=timedelta(minutes=3),
            retry_policy=RetryPolicy(
                initial_interval=timedelta(seconds=5),
                maximum_interval=timedelta(seconds=60),
                maximum_attempts=5,
                non_retryable_error_types=["ContentFilterError"],
            ),
        )

        # Step 3: Publish the report
        doc_id = await workflow.execute_activity(
            publish_report_activity,
            synthesis,
            target,
            start_to_close_timeout=timedelta(seconds=30),
            retry_policy=RetryPolicy(maximum_attempts=3),
        )

        return doc_id

A few key configuration choices for LLM activities: set start_to_close_timeout to at least 2–3 minutes (LLM calls with large contexts and tool use are slow). Use non_retryable_error_types to immediately fail on errors that retries won't fix, like content filter violations or invalid API keys. And set generous maximum_interval values to respect rate limits during backoff.

The Saga Pattern for Compensation

The Saga pattern addresses a critical production concern: what happens when step four of a five-step workflow fails, and steps one through three have already produced side effects? You need to undo those side effects.

In Temporal, this is implemented by maintaining a list of compensation actions and executing them in reverse order on failure.

@workflow.defn
class DocumentProcessingWorkflow:
    @workflow.run
    async def run(self, document_id: str) -> str:
        compensations: list[tuple[str, any]] = []

        try:
            # Step 1: Extract text from document
            text = await workflow.execute_activity(
                extract_text_activity,
                document_id,
                start_to_close_timeout=timedelta(minutes=2),
            )
            compensations.append(("delete_extraction", document_id))

            # Step 2: Generate embeddings and store in vector DB
            embedding_id = await workflow.execute_activity(
                generate_embeddings_activity,
                text,
                start_to_close_timeout=timedelta(minutes=3),
            )
            compensations.append(("delete_embeddings", embedding_id))

            # Step 3: Generate summary with LLM
            summary = await workflow.execute_activity(
                generate_summary_activity,
                text,
                start_to_close_timeout=timedelta(minutes=3),
                retry_policy=RetryPolicy(maximum_attempts=3),
            )
            compensations.append(("delete_summary", document_id))

            # Step 4: Update search index
            await workflow.execute_activity(
                update_search_index_activity,
                document_id,
                summary,
                embedding_id,
                start_to_close_timeout=timedelta(seconds=30),
            )

            return summary

        except Exception as e:
            # Execute compensations in reverse order
            for comp_action, comp_arg in reversed(compensations):
                try:
                    await workflow.execute_activity(
                        comp_action,
                        comp_arg,
                        start_to_close_timeout=timedelta(seconds=30),
                        retry_policy=RetryPolicy(maximum_attempts=3),
                    )
                except Exception:
                    workflow.logger.error(
                        f"Compensation {comp_action} failed for {comp_arg}"
                    )
            raise

This pattern ensures that a failure in the search index update triggers the deletion of the summary, then the embeddings, then the extracted text — leaving the system in a consistent state rather than a partially processed mess. It's one of those patterns that feels like overkill until the first time it saves you from a data corruption incident.

Workflow History Considerations

Temporal stores the complete event history of every workflow execution, and this creates a specific concern for AI workflows: LLM payloads are large. A single LLM call might include a 10,000-token context and a 2,000-token response. Across dozens of activity calls, the workflow history can grow to megabytes, impacting performance and storage costs.

The solution is payload codecs. Temporal supports custom data converters that compress and decompress payloads transparently. For AI workflows, you'll want a codec that compresses large text payloads with zlib or zstd before they enter the event history:

Production teams handling LLM workflows should implement payload compression early. A simple zstd codec can reduce history size by 60–80% for text-heavy AI payloads, significantly cutting Temporal Cluster storage costs and improving replay performance.

The Two-Layer Architecture: LangGraph + Temporal

Increasingly, production teams are converging on a two-layer architecture that combines the strengths of both systems. The insight is straightforward: LangGraph and Temporal solve different problems, and they compose naturally.

  • Temporal is the outer orchestration layer. It handles durable execution, fault tolerance, distributed parallelism, long-running lifecycle management, and cross-service coordination.
  • LangGraph is the inner agent logic layer. Each Temporal activity can contain a complete LangGraph StateGraph that handles the nuanced, conditional, graph-based reasoning of an individual agent.

Think of it like this: Temporal makes sure the train stays on the tracks and arrives at its destination, while LangGraph handles what happens inside each car.

When to Use LangGraph Alone vs Adding Temporal

LangGraph alone is sufficient when your agent workflow runs within a single service, moderate reliability is acceptable, and the workflow completes in minutes rather than hours. LangGraph's built-in checkpointing with PostgresSaver provides crash recovery, and its interrupt mechanism handles human-in-the-loop requirements.

Add Temporal when you need: workflows that span multiple services or microservices, execution durations measured in hours or days, strict exactly-once guarantees, the Saga compensation pattern, horizontal scaling of worker pools, or enterprise-grade audit trails. Temporal has been battle-tested at companies that simply cannot afford failures — that track record matters when your CEO asks "will this work at scale?"

Composing the Two Layers

So, let's look at a concrete example. Here's a Temporal workflow that coordinates multiple LangGraph agents as activities:

from langgraph.graph import StateGraph, START, END
from langgraph.checkpoint.postgres import PostgresSaver
from temporalio import activity, workflow
from temporalio.common import RetryPolicy
from datetime import timedelta
from typing import TypedDict

# --- LangGraph Agent Definitions ---

class AnalysisState(TypedDict):
    document: str
    key_findings: list[str]
    risk_assessment: str
    confidence: float

def extract_findings_node(state: AnalysisState) -> dict:
    findings = llm.invoke(
        f"Extract key findings from:\n{state['document']}"
    )
    return {"key_findings": findings.content.split("\n")}

def assess_risk_node(state: AnalysisState) -> dict:
    assessment = llm.invoke(
        f"Assess risks based on findings:\n{state['key_findings']}"
    )
    return {
        "risk_assessment": assessment.content,
        "confidence": extract_confidence(assessment.content),
    }

def build_analysis_graph() -> StateGraph:
    builder = StateGraph(AnalysisState)
    builder.add_node("extract", extract_findings_node)
    builder.add_node("assess", assess_risk_node)
    builder.add_edge(START, "extract")
    builder.add_edge("extract", "assess")
    builder.add_edge("assess", END)
    return builder

# --- Temporal Activities wrapping LangGraph Agents ---

@activity.defn
async def analyze_document_activity(document: str) -> dict:
    """Run the LangGraph analysis agent as a Temporal activity."""
    graph = build_analysis_graph().compile()
    result = graph.invoke({"document": document})
    return {
        "findings": result["key_findings"],
        "risk": result["risk_assessment"],
        "confidence": result["confidence"],
    }

@activity.defn
async def compile_portfolio_report_activity(
    analyses: list[dict],
) -> str:
    """Use an LLM to compile individual analyses into a report."""
    analyses_text = "\n\n".join(
        f"Document: {a['findings']}\nRisk: {a['risk']}"
        for a in analyses
    )
    response = await openai_client.chat.completions.create(
        model="gpt-4o",
        messages=[{
            "role": "user",
            "content": f"Compile a portfolio risk report:\n{analyses_text}",
        }],
    )
    return response.choices[0].message.content

# --- Temporal Workflow: Outer Orchestration ---

@workflow.defn
class PortfolioAnalysisWorkflow:
    """Temporal workflow coordinating multiple LangGraph agents."""

    @workflow.run
    async def run(self, documents: list[str]) -> str:
        # Fan out: analyze each document in parallel
        # Each activity runs a complete LangGraph agent
        analysis_tasks = []
        for doc in documents:
            task = workflow.execute_activity(
                analyze_document_activity,
                doc,
                start_to_close_timeout=timedelta(minutes=5),
                retry_policy=RetryPolicy(
                    initial_interval=timedelta(seconds=5),
                    maximum_attempts=3,
                ),
            )
            analysis_tasks.append(task)

        # Fan in: wait for all analyses to complete
        analyses = await asyncio.gather(*analysis_tasks)

        # Compile the final portfolio report
        report = await workflow.execute_activity(
            compile_portfolio_report_activity,
            list(analyses),
            start_to_close_timeout=timedelta(minutes=3),
            retry_policy=RetryPolicy(maximum_attempts=3),
        )

        return report

In this architecture, the Temporal workflow handles the macro-level orchestration: distributing documents across workers, managing parallelism, retrying failed analyses, and ensuring the entire pipeline completes. Each analyze_document_activity contains a LangGraph StateGraph that handles the micro-level agent logic — extracting findings, assessing risk, and routing based on confidence scores.

Real-World Adoption

This pattern isn't theoretical. Rakuten, GitLab, and Elastic are running LangGraph in production for their agent workflows, leveraging LangSmith for observability. Grid Dynamics has documented their migration to Temporal for agent lifecycle management, citing the need for cross-service coordination and strict durability guarantees that LangGraph's built-in checkpointing alone couldn't provide. The Temporal team has also published reference architectures for AI agent workflows, including a multi-turn conversation agent running entirely inside a Temporal workflow.

Production Patterns and Best Practices

Having covered the architecture, let's turn to the operational patterns that separate a working system from a reliable one.

Idempotent Activities

This is the single most important production pattern. Both LangGraph nodes and Temporal activities can be retried after failures. If your activity creates a database record on every execution, a retry will create a duplicate. Every activity that produces side effects must be idempotent — producing the same result regardless of how many times it's called with the same input.

For LLM calls, idempotency is nuanced because LLMs are non-deterministic. The pattern is to check whether the output of a previous execution already exists before making the call:

  • Generate a deterministic idempotency key from the activity inputs (e.g., hash the prompt and model name).
  • Check a cache or database for an existing result with that key.
  • If found, return the cached result. If not, make the LLM call and store the result.

Thread ID Patterns

LangGraph's thread_id is the primary mechanism for state isolation. Use it intentionally:

  • Single-shot tasks: Generate a unique thread_id per task (UUID). Each research query, each document analysis gets its own thread. State is discarded after completion.
  • Conversational agents: Reuse the same thread_id across multiple invocations to maintain conversation memory. The thread becomes the user's session.
  • Multi-tenant systems: Include the tenant ID in the thread ID pattern (e.g., tenant-123-task-456) to ensure clean isolation.

TTL for Checkpoint Cleanup

Checkpoints accumulate in your database. A busy production system can generate millions of checkpoints per day. LangGraph supports TTL (time-to-live) configuration on checkpoints — use it aggressively. For single-shot tasks, set a TTL of 24–48 hours. For conversational agents, set a TTL based on your session expiration policy. Monitor your checkpoint table size as a standard operational metric.

This is one of those things that's easy to forget until your Postgres instance starts running out of disk space at 2 AM.

Checkpoint Compression for Cost Optimization

LangGraph state objects that contain LLM outputs, retrieved documents, or embedding vectors can be substantial. At scale, checkpoint storage becomes a real cost driver. Avoid storing large binary objects (images, PDFs, audio) directly in state — instead, store them in object storage like S3 or GCS and keep only the reference URI in state. For text-heavy state, consider implementing a custom serializer that compresses before writing to the checkpoint backend.

Versioning State Schemas for Rolling Deployments

When you update your LangGraph state schema — adding a field, renaming a field, changing a type — existing checkpoints become incompatible. In a rolling deployment, old workers are running the old schema while new workers are running the new schema. Here's how to handle it:

  • Add new fields with default values rather than renaming or removing fields.
  • Use Optional types for new fields so that old checkpoints deserialize without errors.
  • Implement a migration path for checkpoints that need upgrading to the new schema.
  • Never delete a field from the state schema in a single release. Deprecate it first, stop writing to it, then remove it in a subsequent release.

Observability Integration

LangSmith is the observability platform for LangGraph workflows. It captures traces for every LLM call, tool invocation, and state transition within a graph execution. Use it to debug failures, monitor latency, track token usage, and identify performance bottlenecks. In production, every LangGraph invocation should emit a trace to LangSmith.

Temporal Web UI provides visibility into workflow executions at the orchestration level: which workflows are running, which have failed, what their event histories look like. For the two-layer architecture, you'll need both — LangSmith for agent-level debugging and Temporal Web UI for workflow-level monitoring.

Anti-Patterns to Avoid

  • Using InMemorySaver in production. This is by far the most common mistake. InMemorySaver loses all state on process restart. It exists for local development only.
  • Blocking in Temporal workflow code. Temporal workflows must be deterministic and non-blocking. Never make HTTP calls, sleep with time.sleep(), or perform I/O directly in workflow code. Use activities for all I/O.
  • Missing idempotency keys. Without idempotency, retries create duplicate side effects: duplicate database records, duplicate emails, duplicate LLM charges. It adds up fast.
  • Storing raw LLM outputs in Temporal workflow history. Large payloads bloat the event history. Use payload codecs or store outputs externally and pass references.
  • Ignoring checkpoint TTL. Unbounded checkpoint accumulation will eventually fill your database and degrade performance.
  • Hardcoding timeouts. LLM response times vary dramatically based on model, context length, and server load. Configure timeouts as environment variables and tune them based on p99 latency observations.

Choosing the Right Orchestration Strategy

Not every agent workflow needs both LangGraph and Temporal. Here's a decision framework based on your requirements.

LangGraph Alone

Use LangGraph alone when your agent workflow runs within a single service, the workflow completes in minutes, and moderate crash recovery is sufficient (PostgresSaver checkpointing covers most cases). You'll also want it when you need human-in-the-loop capabilities but not cross-service coordination. This covers the majority of agent workflows: chatbots with tool use, RAG pipelines, single-agent research tasks, and content generation workflows.

LangGraph + Temporal

Add Temporal when you need: multi-service coordination (the workflow spans databases, APIs, and external services with side effects that need compensation), execution times measured in hours or days, strict exactly-once processing guarantees, horizontal scaling of agent worker pools across multiple machines, enterprise audit trails and compliance requirements, or the Saga compensation pattern for complex rollback scenarios. This is the right choice for document processing pipelines, financial analysis workflows, multi-agent systems with complex coordination requirements, and any workflow where a failure has significant business impact.

Temporal Alone

Use Temporal without LangGraph when your workflow involves durable execution but isn't primarily LLM-driven — think data pipeline orchestration, payment processing, order fulfillment, or infrastructure automation. Temporal's durable execution is valuable for any distributed workflow, not just AI-specific ones.

Comparison Table

Capability LangGraph Alone LangGraph + Temporal Temporal Alone
Agent-specific abstractions Strong (StateGraph, conditional routing) Strong (LangGraph inner layer) None (general-purpose)
State persistence PostgresSaver, DynamoDB Both layers persist state Temporal event history
Crash recovery Checkpoint-based resume Full durable execution Full durable execution
Human-in-the-loop interrupt() / interrupt_before Both layers support HITL Signals and queries
Compensation / rollback Manual implementation Saga pattern in Temporal Saga pattern
Multi-service coordination Limited (single process) Strong (Temporal workers) Strong (Temporal workers)
Horizontal scaling Requires external infra Temporal worker pools Temporal worker pools
Observability LangSmith LangSmith + Temporal Web UI Temporal Web UI
Operational complexity Low (single dependency) High (two systems to operate) Medium (Temporal Cluster)
Best for Single-agent workflows Mission-critical multi-agent Non-LLM durable workflows

Conclusion

The orchestration layer is the infrastructure that separates demo agents from production agents. It's not glamorous work. It doesn't make for impressive Twitter demos. But it's the reason some agent systems run reliably at scale while others crash at 3 AM and lose everything.

LangGraph provides the agent-native abstractions: graph-based workflow definition, typed state management, checkpointing, human-in-the-loop interrupts, and fan-out parallelism. It's the right starting point for most agent workflows, and with PostgresSaver checkpointing, it handles crash recovery for single-service deployments nicely. Temporal provides the enterprise-grade durable execution runtime: guaranteed completion, Saga compensation, horizontal scaling, and battle-tested reliability across thousands of production deployments. When your agent workflows become mission-critical — when a failure has real business impact — Temporal is the outer layer that ensures they finish.

The two-layer architecture, with Temporal handling macro-orchestration and LangGraph handling micro-level agent logic, is emerging as the production standard for complex AI systems. It's not always necessary (many workflows run perfectly well with LangGraph alone), but when you need it, nothing else provides the same combination of agent-native abstractions and infrastructure-grade durability.

As the industry matures, expect workflow orchestration for AI agents to become as standardized as web frameworks are for HTTP services. The teams investing in this infrastructure now are the ones that'll be running reliable, scalable agent systems while their competitors are still debugging why their demo stopped working in production. The model is only as good as the runtime that executes it.

About the Author Editorial Team

Our team of expert writers and editors.