How to Test AI Agents in Production: Trajectory Evaluation, Tool Validation, and CI/CD Integration

Learn to test and evaluate AI agents beyond simple output checks. Covers trajectory evaluation, tool use validation with DeepEval and LangChain AgentEvals, golden datasets, and automated CI/CD integration with Python.

Introduction: Why Testing AI Agents Is Fundamentally Different

If you've built and deployed LLM-powered applications, you already know how to evaluate model outputs — checking for relevancy, faithfulness, and hallucination. But here's the thing: AI agents aren't just LLMs generating text. They're autonomous systems that reason, plan, select tools, execute multi-step workflows, and adapt based on intermediate results. That distinction changes everything about how you test them.

Consider a travel booking agent that correctly returns a flight itinerary to the user. Traditional LLM evaluation would mark this as a pass. But what if the agent called the hotel search API instead of the flight search API, got lucky with a cached result, and returned the right answer through an entirely wrong process? That's a silent failure — the output looks correct, but the reasoning path is broken. The next time conditions change even slightly, the agent will fail catastrophically, and you won't have any idea why.

Agent evaluation has to go beyond "did it get the right answer" to ask "did it take the right steps, call the right tools with the right arguments, and arrive at the answer through sound reasoning?" This article covers the frameworks, metrics, and infrastructure you need to evaluate AI agents properly — from trajectory validation to tool use correctness to CI/CD integration. We're focusing specifically on agent-level evaluation concerns (the stuff that's distinct from basic LLM output metrics), using DeepEval's agent metrics and LangChain's AgentEvals library with full working Python examples.

The Three Layers of Agent Evaluation

Robust agent evaluation operates at three distinct layers. Each one answers a different diagnostic question, and honestly, you need all three to build real confidence in a production agent system.

Black-Box Evaluation: Final Response Only

Black-box evaluation treats the agent as an opaque function — input goes in, output comes out, and you evaluate only the final result. Did the agent produce the correct answer? Did it complete the task?

This is the simplest layer. It catches gross failures but tells you nothing about how or why something went wrong. If a test fails here, you know what failed but you'll need to dig deeper to understand the cause.

Glass-Box / Trajectory Evaluation: Did the Agent Take the Correct Path?

This is where things get interesting. Trajectory evaluation examines the sequence of actions the agent took — which tools it called, in what order, and with what arguments. It's where agent evaluation diverges most sharply from LLM evaluation.

You compare the agent's actual trajectory (the ordered list of tool calls and reasoning steps) against an expected trajectory. This layer tells you where in the process things went wrong. Did the agent call the right tool but in the wrong order? Did it skip a required validation step? Did it call unnecessary tools, wasting tokens and latency?

White-Box / Single-Step Evaluation: Unit Testing Each Decision

Single-step evaluation isolates individual decision points within the agent's execution. Given a specific state and context, did the agent make the correct choice at that particular step? Think of it as the agent equivalent of unit testing — you feed the agent a specific intermediate state and verify it selects the correct next action.

This layer tells you why the agent made a wrong decision — whether the prompt was ambiguous, the tool descriptions were misleading, or the model simply failed to reason correctly.

Why You Need All Three

These layers form a diagnostic hierarchy. When a black-box test fails, trajectory evaluation narrows the failure to a specific step. Single-step evaluation then isolates the root cause at that step.

Running only black-box tests is like having integration tests without unit tests — you know something broke, but debugging is painful. Running only single-step tests misses emergent failures that only appear when steps interact. The combination gives you both coverage and debuggability.

  • Final response tells you WHAT failed
  • Trajectory tells you WHERE it failed
  • Single step tells you WHY it failed

I've found that teams who skip trajectory evaluation end up spending way more time debugging production issues. The upfront investment pays for itself quickly.

Setting Up Agent Evaluation with DeepEval

DeepEval provides dedicated agent evaluation metrics that go beyond its LLM evaluation capabilities. If you're already familiar with its LLM-level metrics for evaluating text outputs, the agent-specific metrics add a whole new dimension — they evaluate tool selection, argument correctness, and task completion across multi-step workflows.

Installation and Setup

pip install deepeval openai
deepeval login  # optional: connect to Confident AI for dashboard

The @observe Decorator for Tracing Agent Execution

DeepEval's @observe decorator automatically captures the full execution trace of your agent, including every tool call, argument, and intermediate output. This trace data feeds directly into the agent evaluation metrics.

from deepeval.tracing import observe, update_current_span

@observe(type="agent")
def travel_booking_agent(query: str) -> str:
    # Agent logic here — DeepEval captures the full trace
    update_current_span(output="Booking confirmed: NYC to LAX, March 25")
    return "Booking confirmed: NYC to LAX, March 25"

Agent-Specific Metrics

DeepEval provides two categories of agent metrics that operate at different granularities.

End-to-end metrics evaluate the agent's overall performance across the full task:

  • TaskCompletionMetric — did the agent fully accomplish the requested task?
  • StepEfficiencyMetric — did the agent complete the task without unnecessary steps or redundant tool calls?

Component-level metrics evaluate individual tool interactions:

  • ToolCorrectnessMetric — did the agent select the correct tool at each step?
  • ArgumentCorrectnessMetric — did the agent pass the correct arguments to each tool?

Full Working Example: Travel Booking Agent Evaluation

So, let's put this all together with a concrete example. Here's a complete evaluation setup for a travel booking agent:

from deepeval import evaluate
from deepeval.test_case import LLMTestCase
from deepeval.metrics import (
    TaskCompletionMetric,
    ToolCorrectnessMetric,
)
from deepeval.tracing import observe, update_current_span, Tracer

# Define the tools your agent can use
@observe(type="tool", name="search_flights")
def search_flights(origin: str, destination: str, date: str) -> dict:
    update_current_span(
        input={"origin": origin, "destination": destination, "date": date},
        output={"flight_id": "FL123", "price": 349.00, "airline": "United"}
    )
    return {"flight_id": "FL123", "price": 349.00, "airline": "United"}

@observe(type="tool", name="book_flight")
def book_flight(flight_id: str, passenger_name: str) -> dict:
    update_current_span(
        input={"flight_id": flight_id, "passenger_name": passenger_name},
        output={"confirmation": "BK-98712", "status": "confirmed"}
    )
    return {"confirmation": "BK-98712", "status": "confirmed"}

@observe(type="agent", name="TravelBookingAgent")
def travel_booking_agent(query: str) -> str:
    # Step 1: Search for flights
    flights = search_flights("NYC", "LAX", "2026-03-25")

    # Step 2: Book the selected flight
    booking = book_flight(flights["flight_id"], "Alice Johnson")

    result = f"Booked flight {flights['airline']} FL123, confirmation: {booking['confirmation']}"
    update_current_span(output=result)
    return result

# Run the agent and capture the trace
with Tracer() as tracer:
    result = travel_booking_agent("Book me a flight from NYC to LAX on March 25")
    trace = tracer.get_trace()

# Define evaluation metrics
task_completion = TaskCompletionMetric(threshold=0.7)
tool_correctness = ToolCorrectnessMetric(threshold=0.8)

# Create a test case with the traced execution
test_case = LLMTestCase(
    input="Book me a flight from NYC to LAX on March 25",
    actual_output=result,
    expected_output="Flight booked from NYC to LAX on March 25 with confirmation number",
    expected_tools=["search_flights", "book_flight"],
    tools_called=trace.tools_called,
)

# Evaluate
evaluate(
    test_cases=[test_case],
    metrics=[task_completion, tool_correctness],
)

This example evaluates both whether the agent completed the task and whether it called the correct tools. The trace captures the full execution path, so the metrics can assess not just the output but the entire process.

Trajectory Evaluation with LangChain AgentEvals

LangChain's agentevals library provides specialized evaluators for comparing an agent's actual trajectory — the sequence of tool calls it made — against an expected reference trajectory. It's particularly useful for agents built with LangGraph, but it works with any agent framework that produces structured tool call outputs.

What Trajectory Evaluation Means

A trajectory is simply the ordered list of actions an agent took: which tools it called, in what sequence, and with what arguments. Trajectory evaluation compares the agent's actual trajectory against a reference and determines whether the agent followed an acceptable path.

There are two fundamental approaches: using an LLM as a judge for qualitative assessment, or using deterministic matching for exact comparison.

LLM-as-Judge Trajectory Evaluation

The create_trajectory_llm_as_judge function creates an evaluator that uses an LLM to assess whether the agent's trajectory is reasonable, even if it doesn't exactly match the reference. This is really useful when multiple valid paths exist (which, in my experience, is most real-world scenarios).

from agentevals.trajectory import create_trajectory_llm_as_judge
from langchain_core.messages import AIMessage, ToolMessage, HumanMessage

# Create the LLM-based trajectory evaluator
trajectory_judge = create_trajectory_llm_as_judge()

# Define the actual trajectory (what the agent did)
actual_trajectory = [
    HumanMessage(content="Book a flight from NYC to LAX on March 25"),
    AIMessage(
        content="",
        tool_calls=[{"name": "search_flights", "args": {"origin": "NYC", "destination": "LAX", "date": "2026-03-25"}, "id": "call_1"}]
    ),
    ToolMessage(content='{"flight_id": "FL123", "price": 349}', tool_call_id="call_1"),
    AIMessage(
        content="",
        tool_calls=[{"name": "book_flight", "args": {"flight_id": "FL123", "passenger": "user"}, "id": "call_2"}]
    ),
    ToolMessage(content='{"confirmation": "BK-98712"}', tool_call_id="call_2"),
    AIMessage(content="Your flight is booked. Confirmation: BK-98712"),
]

# Define the expected reference trajectory
reference_trajectory = [
    HumanMessage(content="Book a flight from NYC to LAX on March 25"),
    AIMessage(
        content="",
        tool_calls=[{"name": "search_flights", "args": {"origin": "NYC", "destination": "LAX", "date": "2026-03-25"}, "id": "ref_1"}]
    ),
    ToolMessage(content='{"flight_id": "FL123", "price": 349}', tool_call_id="ref_1"),
    AIMessage(
        content="",
        tool_calls=[{"name": "book_flight", "args": {"flight_id": "FL123", "passenger": "user"}, "id": "ref_2"}]
    ),
    ToolMessage(content='{"confirmation": "BK-98712"}', tool_call_id="ref_2"),
    AIMessage(content="Booked! Your confirmation number is BK-98712."),
]

# Evaluate
result = trajectory_judge.evaluate_run(
    actual_trajectory,
    reference_trajectory,
)
print(f"Score: {result['score']}")  # 1 = acceptable, 0 = not acceptable

Deterministic Trajectory Matching

The create_trajectory_match_evaluator function provides exact, deterministic comparison between trajectories. It supports four matching modes that control how strictly the tool call sequences must align:

from agentevals.trajectory import create_trajectory_match_evaluator

# Strict mode: exact order and exact tool calls must match
strict_evaluator = create_trajectory_match_evaluator(
    trajectory_match_mode="strict",
    tool_args_match_mode="exact",
)

# Superset mode: agent can call extra tools beyond what's expected
superset_evaluator = create_trajectory_match_evaluator(
    trajectory_match_mode="superset",
    tool_args_match_mode="subset",
)

# Unordered mode: same tools called, order doesn't matter
unordered_evaluator = create_trajectory_match_evaluator(
    trajectory_match_mode="unordered",
    tool_args_match_mode="ignore",
)

result = strict_evaluator.evaluate_run(
    actual_trajectory,
    reference_trajectory,
)
print(f"Match: {result['score']}")  # 1 = match, 0 = no match

Matching Mode Reference

  • strict — exact sequence match required; same tools, same order
  • superset — actual trajectory must contain all expected tool calls (but may have extras)
  • subset — expected trajectory must contain all actual tool calls
  • unordered — same tool calls required, but order doesn't matter

For tool_args_match_mode, the options are exact (arguments must match precisely), subset (actual args must be a subset of expected), superset (actual args must be a superset), and ignore (arguments aren't compared at all).

When to Use Each Approach

Use the LLM judge when your agent has multiple valid paths to a solution and you want semantic evaluation of whether the trajectory was reasonable. Use deterministic matching when there's exactly one correct path (or a small set of acceptable variations) and you need fast, reproducible, cost-free evaluation.

In practice, most teams I've seen use deterministic matching in CI/CD for speed and cost savings, and reserve LLM-judged evaluation for periodic deeper assessments. It's a pragmatic split that works well.

Building a Golden Dataset for Agent Evaluation

A golden dataset for agent evaluation is structurally different from one used for LLM evaluation. Instead of just input-output pairs, each entry must include the expected trajectory — the sequence of tool calls the agent should make. This extra dimension is what makes agent evals both more powerful and more work to maintain.

Structure of an Agent Golden Dataset

import json
from dataclasses import dataclass, asdict

@dataclass
class AgentEvalCase:
    """A single evaluation case for an AI agent."""
    case_id: str
    input_query: str
    expected_trajectory: list[dict]  # ordered tool calls
    expected_output: str
    tags: list[str]  # e.g., ["booking", "multi-step", "error-handling"]
    acceptable_trajectory_modes: str  # "strict", "superset", "unordered"

# Define evaluation cases
eval_cases = [
    AgentEvalCase(
        case_id="booking-001",
        input_query="Book a flight from NYC to LAX on March 25",
        expected_trajectory=[
            {"tool": "search_flights", "args": {"origin": "NYC", "destination": "LAX", "date": "2026-03-25"}},
            {"tool": "book_flight", "args": {"flight_id": "FL123", "passenger": "user"}},
        ],
        expected_output="Flight booked with confirmation number",
        tags=["booking", "two-step"],
        acceptable_trajectory_modes="strict",
    ),
    AgentEvalCase(
        case_id="booking-002",
        input_query="Find the cheapest flight from SFO to ORD next Friday and book it",
        expected_trajectory=[
            {"tool": "search_flights", "args": {"origin": "SFO", "destination": "ORD"}},
            {"tool": "compare_prices", "args": {}},
            {"tool": "book_flight", "args": {"passenger": "user"}},
        ],
        expected_output="Cheapest flight booked with confirmation",
        tags=["booking", "multi-step", "comparison"],
        acceptable_trajectory_modes="superset",
    ),
]

# Save as versioned JSON
def save_eval_dataset(cases: list[AgentEvalCase], path: str):
    data = {
        "version": "1.2.0",
        "created": "2026-03-12",
        "cases": [asdict(c) for c in cases],
    }
    with open(path, "w") as f:
        json.dump(data, f, indent=2)

save_eval_dataset(eval_cases, "eval_datasets/agent_evals_v1.2.0.json")

Capturing Production Traces as Evaluation Data

Here's something that took me a while to appreciate: the most valuable golden dataset entries come from real production executions. When your agent handles a request successfully (confirmed by user feedback or downstream validation), capture that trace and add it to your evaluation set. This creates a growing corpus of real-world test cases that reflect actual usage patterns rather than synthetic scenarios you imagined at your desk.

def capture_trace_as_eval_case(trace, user_query: str, case_id: str) -> AgentEvalCase:
    """Convert a production trace into an evaluation case."""
    trajectory = [
        {"tool": step.tool_name, "args": step.tool_args}
        for step in trace.steps
        if step.step_type == "tool_call"
    ]
    return AgentEvalCase(
        case_id=case_id,
        input_query=user_query,
        expected_trajectory=trajectory,
        expected_output=trace.final_output,
        tags=["production-captured"],
        acceptable_trajectory_modes="superset",
    )

Version Control and Non-Determinism

Store your evaluation datasets in version control alongside your agent code. Use semantic versioning for datasets — bump the major version when you add new tool capabilities, minor for new cases, patch for corrections. Keep the dataset in a dedicated eval_datasets/ directory.

Because agents are non-deterministic, a single run isn't a reliable signal. Run each evaluation case multiple times (typically 3-5 runs) and average the scores. A case that passes 4 out of 5 times is meaningfully different from one that passes 1 out of 5, and both are different from a consistent pass or fail.

import statistics

def evaluate_with_averaging(agent_fn, eval_case, evaluator, num_runs=5):
    """Run evaluation multiple times and return averaged score."""
    scores = []
    for _ in range(num_runs):
        actual_trajectory = agent_fn(eval_case.input_query)
        result = evaluator.evaluate_run(actual_trajectory, eval_case.expected_trajectory)
        scores.append(result["score"])

    return {
        "case_id": eval_case.case_id,
        "mean_score": statistics.mean(scores),
        "min_score": min(scores),
        "max_score": max(scores),
        "std_dev": statistics.stdev(scores) if len(scores) > 1 else 0,
        "pass_rate": sum(1 for s in scores if s >= 0.7) / len(scores),
    }

Integrating Agent Evals into CI/CD

Agent evaluations should run automatically as part of your deployment pipeline. The key challenge — and this trips up a lot of teams — is separating fast deterministic tests from slower, costlier LLM-judged evaluations.

Running Agent Evals with pytest

Structure your tests to separate deterministic trajectory checks from LLM-judged evaluations using pytest markers:

# tests/test_agent_evals.py
import pytest
import json
from your_agent import travel_booking_agent
from agentevals.trajectory import (
    create_trajectory_match_evaluator,
    create_trajectory_llm_as_judge,
)

def load_eval_cases(path="eval_datasets/agent_evals_v1.2.0.json"):
    with open(path) as f:
        data = json.load(f)
    return data["cases"]

EVAL_CASES = load_eval_cases()

@pytest.mark.parametrize("case", EVAL_CASES, ids=[c["case_id"] for c in EVAL_CASES])
@pytest.mark.deterministic
def test_trajectory_deterministic(case):
    """Fast, cost-free trajectory matching."""
    evaluator = create_trajectory_match_evaluator(
        trajectory_match_mode=case["acceptable_trajectory_modes"],
        tool_args_match_mode="subset",
    )
    actual = travel_booking_agent(case["input_query"])
    result = evaluator.evaluate_run(actual, case["expected_trajectory"])
    assert result["score"] >= 0.8, f"Trajectory mismatch for {case['case_id']}"

@pytest.mark.parametrize("case", EVAL_CASES, ids=[c["case_id"] for c in EVAL_CASES])
@pytest.mark.llm_judged
def test_trajectory_llm_judge(case):
    """LLM-judged trajectory quality — more expensive, run less frequently."""
    judge = create_trajectory_llm_as_judge()
    actual = travel_booking_agent(case["input_query"])
    result = judge.evaluate_run(actual, case["expected_trajectory"])
    assert result["score"] >= 0.7, f"LLM judge failed for {case['case_id']}"

GitHub Actions Workflow

Use separate workflows for deterministic and LLM-judged evaluations. Deterministic tests run on every push; LLM-judged tests run on a schedule or on pull requests to main.

# .github/workflows/agent-evals.yml
name: Agent Evaluations

on:
  push:
    branches: [main, "feature/**"]
  pull_request:
    branches: [main]
  schedule:
    - cron: "0 6 * * 1-5"  # weekdays at 6 AM UTC

jobs:
  deterministic-evals:
    runs-on: ubuntu-latest
    steps:
      - uses: actions/checkout@v4
      - uses: actions/setup-python@v5
        with:
          python-version: "3.12"
      - run: pip install -r requirements-eval.txt
      - name: Run deterministic trajectory tests
        run: pytest tests/test_agent_evals.py -m deterministic --tb=short -q
        env:
          AGENT_ENV: ci

  llm-judged-evals:
    if: github.event_name == 'pull_request' || github.event_name == 'schedule'
    runs-on: ubuntu-latest
    steps:
      - uses: actions/checkout@v4
      - uses: actions/setup-python@v5
        with:
          python-version: "3.12"
      - run: pip install -r requirements-eval.txt
      - name: Run LLM-judged trajectory tests
        run: pytest tests/test_agent_evals.py -m llm_judged --tb=short -q
        env:
          OPENAI_API_KEY: ${{ secrets.OPENAI_API_KEY }}
      - name: Upload eval results
        if: always()
        uses: actions/upload-artifact@v4
        with:
          name: eval-results
          path: eval_results/

Quality Gates and Flaky Test Management

Set quality gates that block deployment when evaluation scores drop below thresholds. For deterministic tests, use a strict pass/fail gate. For LLM-judged tests, use a score threshold with a tolerance band — for example, block if the average score drops below 0.75 across all cases, but allow individual cases to score as low as 0.6.

Handle flaky tests by running each case multiple times and using the pass rate rather than a single result. A case that passes 4 out of 5 runs with a mean score above 0.8 shouldn't block your pipeline.

For cost management, run the full LLM-judged evaluation suite on a nightly schedule rather than on every commit. On pull requests, run only the subset of evaluations relevant to the changed code paths. Tag your evaluation cases with component labels so you can selectively run subsets — this alone can cut your eval costs by 60-70% without sacrificing meaningful coverage.

Production Monitoring: From Batch Evals to Live Agent Assessment

Batch evaluation during CI/CD catches regressions before deployment. But agents can degrade in production due to changes in upstream APIs, shifts in user query patterns, or model provider updates. You need continuous evaluation of live agent behavior — and this is where a lot of teams drop the ball.

Referenceless Metrics for Production

In production, you don't have a golden dataset for every query. Use referenceless metrics that evaluate agent behavior without expected outputs. DeepEval's TaskCompletionMetric can assess whether the agent appears to have completed the user's request based solely on the input and the agent's actions. StepEfficiencyMetric can flag agents that take an unusually high number of steps compared to historical baselines for similar queries.

Detecting Quality Drift

Track evaluation scores over time as a time series. Compute rolling averages (7-day windows work well) and alert when the average drops below a threshold or when the rate of change exceeds a defined limit. Common signals that indicate drift include increasing average step counts, rising tool error rates, and declining task completion scores.

from deepeval.monitor import monitor

# Log each agent interaction for continuous evaluation
monitor(
    event_name="travel_booking",
    input=user_query,
    output=agent_response,
    model="gpt-4o",
    additional_data={
        "tools_called": trace.tools_called,
        "step_count": trace.step_count,
        "latency_ms": trace.latency_ms,
    },
)

Alerting and Async Evaluation

Set up alerts based on evaluation score trends rather than individual scores. A single low score is noise; a downward trend over days is a real signal. Confident AI's platform supports asynchronous production evaluation — it captures agent traces in real time and runs evaluation metrics in the background without adding latency to the agent's response path.

This means you can run the same rigorous metrics you use in CI/CD on every production interaction, at the cost of slightly delayed feedback. It's a tradeoff most teams are happy to make.

Frequently Asked Questions

What is the difference between LLM evaluation and AI agent evaluation?

LLM evaluation assesses the quality of a model's text output — whether it's relevant, faithful to context, and free of hallucinations. Agent evaluation goes further by assessing the process: did the agent select the correct tools, call them in the right order, pass correct arguments, and complete the task efficiently? An agent can produce a perfect final answer through an incorrect process, which LLM evaluation would miss but agent evaluation would catch. Agent evaluation requires trajectory data (the sequence of tool calls and decisions) in addition to the final output.

How do I test AI agent tool calling accuracy?

Use DeepEval's ToolCorrectnessMetric and ArgumentCorrectnessMetric to evaluate whether the agent selected the right tools and passed correct arguments. For deterministic checks, use LangChain's create_trajectory_match_evaluator with tool_args_match_mode="exact" to compare actual tool calls against expected ones. Capture the agent's execution trace using decorators or middleware, then compare the recorded tool calls against your golden dataset of expected trajectories.

Can I run AI agent evaluations in CI/CD pipelines?

Absolutely. Structure your evaluations as pytest test cases and integrate them into GitHub Actions or your CI/CD platform of choice. The key is separating deterministic trajectory matching tests (fast, free, run on every push) from LLM-judged evaluations (slower, costlier, run on PRs or on a schedule). Use pytest markers to control which tests run in which context, and set quality gates that block deployment when scores drop below configured thresholds.

What metrics should I use for AI agent evaluation?

At minimum, use TaskCompletionMetric (did the agent finish the job?), ToolCorrectnessMetric (did it use the right tools?), and trajectory matching (did it follow an acceptable path?). For deeper evaluation, add StepEfficiencyMetric (did it avoid unnecessary steps?) and ArgumentCorrectnessMetric (did it pass correct parameters to tools?). In production, use referenceless versions of these metrics that don't require a golden dataset for every query.

How do I handle non-deterministic behavior in agent testing?

Run each evaluation case multiple times (3-5 runs is a practical minimum) and use the average score and pass rate rather than any single result. Set thresholds based on pass rates — for example, require a case to pass at least 80% of runs. Use statistics.stdev to track variance: high variance cases indicate unstable agent behavior that may need prompt engineering or architectural changes. For CI/CD, use the mean score across runs and set a slightly lower threshold than you would for a single deterministic test to account for natural variance.

About the Author Editorial Team

Our team of expert writers and editors.