The Evaluation Gap in Production LLM Systems
You've built your RAG pipeline. You've wired up your multi-agent system. The demo looks fantastic. Then you deploy to production, and within a week, your support queue fills up with reports of hallucinated answers, irrelevant responses, and confidently wrong outputs that your users trusted because the AI delivered them with absolute certainty.
Sound familiar? You're not alone.
In 2026, studies show that up to 40% of organizations deploying LLM-powered applications encounter significant quality regressions within the first 90 days of production. The root cause isn't bad models or flawed architectures — it's the absence of systematic evaluation. Teams ship LLM applications the way we shipped software in the early 2000s: manually test a few cases, eyeball the outputs, and hope for the best.
That approach worked for deterministic software. It fails catastrophically for probabilistic systems.
LLM evaluation — or "evals," as practitioners call it — is the discipline of systematically measuring whether your AI system produces correct, relevant, safe, and useful outputs. It's the testing layer that traditional software engineering takes for granted but that most AI teams are still building from scratch. The good news? In 2026, the tooling has finally matured enough to make rigorous evaluation accessible, automated, and integrated into your existing CI/CD workflows.
This article is a practitioner's guide to building production-grade LLM evaluation pipelines. We'll cover the core evaluation metrics and when to use each, walk through hands-on implementations with DeepEval and custom evaluators, build golden datasets that actually represent your users, integrate evaluations into GitHub Actions for automated regression testing, and set up production monitoring that catches quality drift before your users notice. So, let's dive in.
Understanding LLM Evaluation Metrics
Before you can evaluate anything, you need to know what you're measuring. LLM evaluation metrics fall into several categories, and choosing the right combination depends entirely on your application type. Let me break down the metrics that matter most in production.
Answer Relevancy
Answer relevancy measures whether the LLM's response actually addresses the user's question. This sounds obvious, but it's surprisingly common for models to generate well-written, grammatically perfect responses that drift completely off-topic or address a related but different question entirely.
Here's a classic example: a user asks "How do I reset my password?" and the model responds with a detailed explanation of password security best practices. Technically related. Practically useless.
Answer relevancy is typically evaluated using an LLM-as-a-judge approach — a second model reads the input and output, then scores how well the output addresses the input. Scores above 0.7 are generally considered acceptable, though the threshold depends on your application's tolerance for tangential responses.
Faithfulness (Groundedness)
Faithfulness measures whether the LLM's claims are supported by the provided context. This is the anti-hallucination metric, and honestly, it's arguably the single most important metric for RAG systems. If your retriever surfaces three documents about product pricing and the model invents a fourth pricing tier that doesn't exist in any of them, faithfulness catches that.
The evaluation works by decomposing the output into individual claims, then checking each claim against the context. A faithfulness score of 0.85 means 85% of the claims in the output are directly supported by the provided documents. For customer-facing applications, you typically want this above 0.9.
Hallucination
While faithfulness checks claims against provided context, the hallucination metric takes a broader view. It evaluates whether the output contains fabricated facts, invented citations, made-up statistics, or other information that the model generated from its parametric knowledge (or simply made up) rather than from the provided context.
In production systems, hallucination is the metric that keeps legal and compliance teams up at night — and rightly so.
Contextual Relevancy and Recall
These metrics evaluate the retrieval component of RAG systems rather than the generation component. Contextual relevancy measures whether the retrieved documents are actually relevant to the query (basically: are you retrieving noise?), while contextual recall measures whether the retrieval captured all the relevant information available in your knowledge base (are you missing important documents?).
Together, they tell you whether your retrieval pipeline is doing its job. High relevancy but low recall means your retriever is precise but misses relevant documents. Low relevancy but high recall means you're retrieving everything, including a lot of noise. You want both above 0.8 for most production applications.
Task Completion and Correctness
For agent-based systems, the ultimate metric is whether the agent actually completed the task correctly. Did the code generation agent produce code that runs? Did the data analysis agent arrive at the correct conclusion? Did the customer service agent resolve the ticket?
Task completion is domain-specific and usually requires custom evaluation logic — there's no one-size-fits-all metric here.
Building Your First Evaluation Suite with DeepEval
DeepEval has emerged as the go-to open-source framework for LLM evaluation, largely because it maps directly onto patterns that software engineers already understand. If you've written pytest tests before, you'll feel right at home. Let's build a comprehensive evaluation suite from scratch.
Installation and Setup
Start by installing DeepEval and setting up your evaluation environment:
pip install deepeval
# Set your OpenAI API key for LLM-as-judge evaluations
export OPENAI_API_KEY="your-api-key"
# Optional: login to Confident AI for dashboard and tracking
deepeval login
Writing Your First Evaluation Test
DeepEval's core abstraction is the test case — a structured representation of an LLM interaction that includes the input, the actual output, and (optionally) the expected output and context. Here's a complete example evaluating a RAG-based Q&A system:
import pytest
from deepeval import assert_test
from deepeval.test_case import LLMTestCase
from deepeval.metrics import (
AnswerRelevancyMetric,
FaithfulnessMetric,
HallucinationMetric,
ContextualRelevancyMetric,
)
# Define your metrics with thresholds
answer_relevancy = AnswerRelevancyMetric(
threshold=0.7,
model="gpt-4o", # The judge model
include_reason=True # Get explanations for scores
)
faithfulness = FaithfulnessMetric(
threshold=0.85,
model="gpt-4o",
include_reason=True
)
hallucination = HallucinationMetric(
threshold=0.5, # Lower = fewer hallucinations allowed
model="gpt-4o",
include_reason=True
)
contextual_relevancy = ContextualRelevancyMetric(
threshold=0.7,
model="gpt-4o",
include_reason=True
)
def test_rag_qa_pricing_question():
"""Test that our RAG system correctly answers pricing queries."""
test_case = LLMTestCase(
input="What is the pricing for the Pro plan?",
actual_output="The Pro plan costs $49 per month and includes "
"unlimited API calls, priority support, and "
"advanced analytics.",
expected_output="The Pro plan is $49/month with unlimited "
"API calls, priority support, and analytics.",
retrieval_context=[
"Pro Plan: $49/month. Features include unlimited API "
"calls, priority support, and advanced analytics dashboard.",
"Enterprise Plan: Custom pricing. Contact sales for details.",
]
)
assert_test(test_case, [
answer_relevancy,
faithfulness,
hallucination,
contextual_relevancy,
])
def test_rag_qa_refund_policy():
"""Test handling of refund policy questions."""
test_case = LLMTestCase(
input="What is your refund policy?",
actual_output="We offer a 30-day money-back guarantee on all "
"plans. To request a refund, contact support "
"with your order ID.",
retrieval_context=[
"Refund Policy: All plans come with a 30-day money-back "
"guarantee. Customers must contact [email protected] "
"with their order ID to initiate the refund process.",
]
)
assert_test(test_case, [faithfulness, hallucination])
Run these tests exactly as you would run pytest:
# Run all evaluation tests
deepeval test run tests/test_evaluations.py
# Run with verbose output showing metric scores
deepeval test run tests/test_evaluations.py -v
# Run specific tests
deepeval test run tests/test_evaluations.py::test_rag_qa_pricing_question
Each test will report pass/fail along with the individual metric scores and — critically — the reasoning behind each score. That reasoning is what makes LLM-as-judge evaluations actionable: when a test fails, you know why it failed, not just that it did.
Evaluating Against a Dataset
Individual test cases are useful for targeted regression tests, but real evaluation requires running against a dataset. DeepEval's EvaluationDataset lets you manage collections of test cases:
from deepeval.dataset import EvaluationDataset
from deepeval import evaluate
# Create a dataset from test cases
dataset = EvaluationDataset(test_cases=[
LLMTestCase(
input="How do I upgrade my plan?",
actual_output=my_rag_system.query("How do I upgrade my plan?"),
retrieval_context=my_rag_system.get_context(
"How do I upgrade my plan?"
),
),
LLMTestCase(
input="Can I cancel anytime?",
actual_output=my_rag_system.query("Can I cancel anytime?"),
retrieval_context=my_rag_system.get_context(
"Can I cancel anytime?"
),
),
# ... more test cases
])
# Run evaluation across all test cases and metrics
results = evaluate(
test_cases=dataset,
metrics=[
answer_relevancy,
faithfulness,
hallucination,
]
)
# Print aggregate results
print(f"Overall pass rate: {results.pass_rate}")
print(f"Total tests: {results.total_tests}")
print(f"Passed: {results.passed_tests}")
print(f"Failed: {results.failed_tests}")
The LLM-as-Judge Pattern: Building Custom Evaluators
While DeepEval's built-in metrics cover the most common evaluation scenarios, production systems often need custom evaluators tailored to domain-specific quality criteria. This is where the LLM-as-judge pattern really shines — and where you need to understand both its power and its limitations.
How LLM-as-Judge Works
The concept is straightforward: you use a (typically more capable) LLM to evaluate the output of the LLM in your application. The judge model receives the input, the output, any context, and a rubric describing what "good" looks like. It then produces a score and a reasoning chain explaining that score.
The key insight here is that judging quality is easier than producing quality. A model that might struggle to write a perfect technical explanation can still reliably distinguish between a good explanation and a bad one. This asymmetry is what makes the whole approach work.
Building a Custom Evaluator
Here's how to build a custom evaluator for, say, measuring the technical accuracy of code explanations. DeepEval's GEval metric makes this surprisingly straightforward:
from deepeval.metrics import GEval
from deepeval.test_case import LLMTestCaseParams
# Custom metric using G-Eval for technical accuracy
technical_accuracy = GEval(
name="Technical Accuracy",
criteria=(
"Evaluate whether the AI's response contains technically "
"accurate information about programming concepts, APIs, "
"and software engineering practices. The response should "
"not contain incorrect syntax, deprecated methods, or "
"misleading architectural advice."
),
evaluation_params=[
LLMTestCaseParams.INPUT,
LLMTestCaseParams.ACTUAL_OUTPUT,
],
evaluation_steps=[
"Identify all technical claims in the response.",
"Verify each claim against known programming standards "
"and best practices.",
"Check code snippets for syntactic correctness.",
"Assess whether the recommended approach follows "
"current industry conventions.",
"Penalize outdated or deprecated suggestions.",
],
threshold=0.7,
model="gpt-4o",
)
# Custom metric for response completeness
completeness = GEval(
name="Response Completeness",
criteria=(
"Evaluate whether the AI's response fully addresses all "
"aspects of the user's question. A complete response "
"should cover the main question, relevant edge cases, "
"and provide actionable next steps."
),
evaluation_params=[
LLMTestCaseParams.INPUT,
LLMTestCaseParams.ACTUAL_OUTPUT,
LLMTestCaseParams.EXPECTED_OUTPUT,
],
evaluation_steps=[
"List all sub-questions or aspects of the user's query.",
"Check if each aspect is addressed in the response.",
"Compare coverage against the expected output.",
"Evaluate whether actionable steps are provided.",
],
threshold=0.6,
model="gpt-4o",
)
Building a Raw LLM-as-Judge Without Frameworks
Sometimes you need full control over the evaluation logic — maybe the framework doesn't support your exact use case, or you want tighter integration with your existing codebase. Here's a framework-free implementation using structured outputs to ensure reliable scoring:
from openai import OpenAI
from pydantic import BaseModel, Field
client = OpenAI()
class EvaluationResult(BaseModel):
"""Structured evaluation output."""
score: float = Field(
ge=0.0, le=1.0,
description="Quality score from 0.0 (worst) to 1.0 (best)"
)
reasoning: str = Field(
description="Step-by-step reasoning for the score"
)
issues: list[str] = Field(
default_factory=list,
description="Specific issues found in the response"
)
passed: bool = Field(
description="Whether the response meets quality threshold"
)
def evaluate_response(
user_input: str,
llm_output: str,
context: list[str] | None = None,
threshold: float = 0.7,
) -> EvaluationResult:
"""Evaluate an LLM response using a judge model."""
context_block = ""
if context:
formatted = "\n".join(
f"- {doc}" for doc in context
)
context_block = (
f"\n\nProvided Context:\n{formatted}"
)
evaluation_prompt = f"""Evaluate the following AI response
for quality, accuracy, and helpfulness.
User Question: {user_input}
{context_block}
AI Response: {llm_output}
Evaluate on these criteria:
1. Accuracy: Are all claims factually correct?
2. Relevancy: Does the response address the question?
3. Completeness: Are all aspects of the question covered?
4. Clarity: Is the response well-structured and clear?
5. Faithfulness: If context is provided, does the response
stay grounded in the provided information?
A score of {threshold} or above means the response passes."""
completion = client.beta.chat.completions.parse(
model="gpt-4o",
messages=[
{
"role": "system",
"content": "You are an expert AI response evaluator. "
"Be rigorous but fair in your assessment."
},
{"role": "user", "content": evaluation_prompt}
],
response_format=EvaluationResult,
)
return completion.choices[0].message.parsed
# Usage
result = evaluate_response(
user_input="How do I implement retry logic in Python?",
llm_output="Use the tenacity library with @retry decorator.",
threshold=0.7,
)
print(f"Score: {result.score}")
print(f"Passed: {result.passed}")
print(f"Reasoning: {result.reasoning}")
for issue in result.issues:
print(f" Issue: {issue}")
Pitfalls of LLM-as-Judge
Before you go all-in on LLM-as-judge, it's worth understanding its limitations. I've seen teams get burned by ignoring these:
- Position bias: Judge models tend to favor the first response when comparing two outputs. Mitigate this by randomizing the order in A/B evaluations.
- Verbosity bias: Longer responses often score higher, even when a concise answer would be more appropriate. Add explicit instructions to the rubric that brevity shouldn't be penalized.
- Self-bias: Models rate outputs from their own model family higher. Use a different model family for judging than the one producing the outputs when feasible.
- Inconsistency: The same input can produce different scores across runs. Run evaluations multiple times and use majority voting or averaging. DeepEval handles this automatically with its self-consistency features.
Creating Golden Datasets That Actually Work
A golden dataset is your source of truth — a curated collection of inputs, expected outputs, and contexts that represent how your system should behave. Here's the thing: the quality of your evaluations is only as good as the quality of your golden dataset.
Most teams get this wrong by either creating too few examples that don't cover their problem space, or by generating synthetic examples that don't reflect real user behavior. I've been guilty of both.
Sources for Golden Dataset Examples
The best golden datasets combine examples from multiple sources:
- Production logs: Extract real user queries from your production system. These represent actual usage patterns, including the misspellings, ambiguous phrasing, and edge cases that synthetic examples miss. Aim for diversity across user segments, query complexity, and topic areas.
- Expert-authored "must-pass" cases: Have domain experts write test cases for critical scenarios — the questions your system absolutely must get right. These should include explicit acceptance criteria: "The response MUST mention the 30-day return window and MUST NOT suggest contacting a department that doesn't exist."
- Known failure cases: Every production bug is a potential golden dataset entry. When users report bad responses, turn those into test cases with correct expected outputs. Over time, this builds a regression suite that prevents the same failures from recurring.
- Adversarial examples: Intentionally craft inputs designed to trip up your system — prompt injection attempts, out-of-scope questions, queries requiring information that isn't in your knowledge base. Your system should fail gracefully on these, and your golden dataset should verify that it does.
Dataset Size and Structure
How many examples do you actually need? The research suggests that 100 to 200 high-quality examples provide enough statistical significance for most applications. That said, quality matters far more than quantity — fifty expertly curated examples with clear acceptance criteria will tell you more than 500 auto-generated examples with vague expected outputs.
Structure your dataset as a JSON or CSV file that can be version-controlled alongside your code:
import json
from deepeval.dataset import EvaluationDataset, Golden
# Load golden dataset from file
def load_golden_dataset(path: str) -> EvaluationDataset:
"""Load a golden dataset from a JSON file."""
with open(path) as f:
data = json.load(f)
goldens = []
for entry in data["test_cases"]:
goldens.append(Golden(
input=entry["input"],
expected_output=entry.get("expected_output"),
context=entry.get("context"),
additional_metadata={
"category": entry.get("category", "general"),
"priority": entry.get("priority", "normal"),
"source": entry.get("source", "manual"),
}
))
return EvaluationDataset(goldens=goldens)
# Example golden dataset JSON structure
golden_dataset_template = {
"version": "1.2",
"last_updated": "2026-02-13",
"test_cases": [
{
"input": "What happens if I exceed my API rate limit?",
"expected_output": "When you exceed your rate limit, "
"the API returns a 429 status code. You should "
"implement exponential backoff...",
"context": [
"Rate Limiting: Free tier: 100 requests/minute. "
"Pro tier: 1000 requests/minute. Exceeding the "
"limit returns HTTP 429. Implement exponential "
"backoff with a maximum of 3 retries."
],
"category": "api_usage",
"priority": "critical",
"source": "expert_authored"
},
{
"input": "can u help me with the thng that broke",
"expected_output": None,
"context": [],
"category": "ambiguous_queries",
"priority": "normal",
"source": "production_logs"
}
]
}
Versioning and Maintenance
Golden datasets aren't static. Your product changes, your knowledge base grows, and new failure modes emerge. Treat your golden dataset like code:
- Store it in version control (Git) alongside your evaluation tests.
- Tag dataset versions to match prompt or model versions.
- Review and update quarterly — stale test cases create false confidence.
- Add every production failure as a new test case within 48 hours of discovery.
- Remove test cases that no longer reflect valid product behavior.
Integrating LLM Evals into CI/CD Pipelines
This is where it gets really good. The real payoff comes when evaluations run automatically on every code change. A developer modifies a prompt template, pushes a commit, and within minutes, the CI pipeline tells them whether the change improved or degraded the system's quality. No manual testing. No "it looked fine when I tried a few queries." Just data.
GitHub Actions Workflow
Here's a production-ready GitHub Actions workflow that runs your LLM evaluations on every pull request:
# .github/workflows/llm-evals.yml
name: LLM Evaluation Pipeline
on:
pull_request:
paths:
- 'prompts/**'
- 'src/llm/**'
- 'tests/evals/**'
- 'golden_datasets/**'
push:
branches: [main]
jobs:
llm-evals:
runs-on: ubuntu-latest
timeout-minutes: 30
steps:
- uses: actions/checkout@v4
- name: Set up Python
uses: actions/setup-python@v5
with:
python-version: '3.12'
- name: Install dependencies
run: |
pip install -r requirements.txt
pip install deepeval
- name: Run LLM evaluations
env:
OPENAI_API_KEY: ${{ secrets.OPENAI_API_KEY }}
run: |
deepeval test run tests/evals/ -v --ignore-errors
- name: Upload evaluation report
if: always()
uses: actions/upload-artifact@v4
with:
name: eval-report
path: .deepeval/results/
- name: Comment PR with results
if: github.event_name == 'pull_request' && always()
uses: actions/github-script@v7
with:
script: |
const fs = require('fs');
const results = fs.readFileSync(
'.deepeval/results/latest.json', 'utf8'
);
const data = JSON.parse(results);
const body = [
'## LLM Evaluation Results',
'',
`**Pass Rate:** ${data.pass_rate}%`,
`**Tests Passed:** ${data.passed}/${data.total}`,
'',
'| Metric | Score | Threshold | Status |',
'|--------|-------|-----------|--------|',
...data.metrics.map(m =>
`| ${m.name} | ${m.score.toFixed(2)} `
+ `| ${m.threshold} `
+ `| ${m.passed ? 'Pass' : 'FAIL'} |`
),
].join('\n');
github.rest.issues.createComment({
issue_number: context.issue.number,
owner: context.repo.owner,
repo: context.repo.repo,
body
});
Key Design Decisions for CI/CD Integration
Several design decisions make the difference between an evaluation pipeline that teams actually use and one that gets disabled after the first week:
Trigger selectively. Notice that the workflow only triggers on changes to prompts, LLM-related source code, evaluation tests, and golden datasets. Running LLM evals on every CSS change wastes both time and API credits. Be precise about what triggers evaluations.
Set meaningful timeouts. LLM evaluations involve multiple API calls and can take minutes. A 30-minute timeout prevents runaway costs while giving enough headroom for large evaluation suites. If your evals consistently take longer than 15 minutes, consider breaking them into separate jobs that run in parallel.
Use --ignore-errors wisely. The --ignore-errors flag in DeepEval prevents the entire test suite from aborting when a single test case encounters a transient API error. This is important for CI reliability, but make sure you're still reporting individual test failures.
Make results visible. The PR comment step is crucial. If evaluation results are buried in CI logs that nobody reads, the entire pipeline is theater. Put the results directly on the PR where reviewers will actually see them.
Blocking vs. Non-Blocking Evaluations
Not all evaluation failures should block a merge. Consider a tiered approach:
- Blocking (required): Hallucination rate below threshold, faithfulness above threshold, no prompt injection vulnerabilities detected. These are safety-critical and should prevent the PR from being merged.
- Non-blocking (advisory): Answer relevancy scores, response completeness, stylistic evaluations. These inform the review but don't automatically block deployment. Report them, but let humans decide whether a dip from 0.82 to 0.79 on answer relevancy is acceptable given the other improvements in the PR.
Production Monitoring and Observability
CI/CD evaluations tell you whether a change is safe to deploy. Production monitoring tells you whether the system continues to work once it's deployed. Both are essential, and they serve fundamentally different purposes.
What to Monitor
In a production LLM system, you need observability across three layers:
Operational metrics track system health: latency (P50, P95, P99), token usage and cost per query, error rates and API failures, and throughput (queries per second). These are table stakes — you'd monitor them for any production service.
Quality metrics track output quality using the same metrics from your evaluation suite, but now applied to live traffic: faithfulness scores on a sampled subset of production queries, hallucination detection rates, user feedback (thumbs up/down, explicit ratings), and response relevancy trends over time.
Behavioral metrics are the ones that often get overlooked, but they're incredibly valuable. They track how the system is being used and whether usage patterns are shifting: query category distribution (are users suddenly asking more questions about a topic you have poor coverage on?), retrieval hit rates (are users asking questions your knowledge base can't answer?), conversation abandonment rates (are users giving up mid-conversation?), and escalation rates (how often do users need to talk to a human?).
Setting Up Tracing with Langfuse
Langfuse is an open-source LLM observability platform that gives you detailed tracing, evaluation scoring, and metrics dashboards. Here's how to instrument your application:
from langfuse import Langfuse
from langfuse.decorators import observe, langfuse_context
from openai import OpenAI
# Initialize Langfuse client
langfuse = Langfuse(
public_key="pk-...",
secret_key="sk-...",
host="https://cloud.langfuse.com" # or self-hosted URL
)
client = OpenAI()
@observe(as_type="generation")
def generate_response(query: str, context: list[str]) -> str:
"""Generate a response with full observability tracing."""
context_text = "\n\n".join(context)
messages = [
{
"role": "system",
"content": f"Answer based on this context:\n{context_text}"
},
{"role": "user", "content": query}
]
response = client.chat.completions.create(
model="gpt-4o",
messages=messages,
temperature=0.1,
)
return response.choices[0].message.content
@observe()
def rag_pipeline(query: str) -> dict:
"""Full RAG pipeline with tracing at each step."""
# Step 1: Retrieve documents
langfuse_context.update_current_observation(
name="retrieve_documents"
)
documents = retrieve_documents(query)
# Step 2: Generate response
response = generate_response(
query=query,
context=[doc.page_content for doc in documents]
)
# Step 3: Score the trace for online evaluation
langfuse_context.score_current_trace(
name="user_satisfaction",
value=1.0, # Updated later from user feedback
comment="Awaiting user feedback"
)
return {
"response": response,
"sources": [doc.metadata for doc in documents]
}
With Langfuse tracing in place, every LLM call is recorded with its full input, output, latency, token counts, and cost. You can then run online evaluations against production traces — the same metrics you use in CI/CD, now applied to real user interactions.
Building Alerting on Quality Metrics
Monitoring without alerting is just data collection. You need alerts that trigger when quality metrics drift beyond acceptable bounds:
from datetime import datetime, timedelta
def check_quality_metrics(langfuse_client: Langfuse):
"""Check production quality metrics and alert on degradation."""
# Fetch traces from the last hour
end_time = datetime.utcnow()
start_time = end_time - timedelta(hours=1)
traces = langfuse_client.fetch_traces(
limit=500,
from_timestamp=start_time,
to_timestamp=end_time,
)
if not traces.data:
return
# Calculate aggregate metrics
scores = {}
for trace in traces.data:
for score in trace.scores:
if score.name not in scores:
scores[score.name] = []
scores[score.name].append(score.value)
# Define thresholds
alert_thresholds = {
"faithfulness": {"min": 0.85, "window": "1h"},
"hallucination": {"max": 0.15, "window": "1h"},
"answer_relevancy": {"min": 0.70, "window": "1h"},
}
alerts = []
for metric_name, threshold in alert_thresholds.items():
if metric_name in scores:
avg = sum(scores[metric_name]) / len(scores[metric_name])
if "min" in threshold and avg < threshold["min"]:
alerts.append(
f"ALERT: {metric_name} dropped to {avg:.2f} "
f"(threshold: {threshold['min']})"
)
if "max" in threshold and avg > threshold["max"]:
alerts.append(
f"ALERT: {metric_name} rose to {avg:.2f} "
f"(threshold: {threshold['max']})"
)
if alerts:
send_alerts(alerts) # Slack, PagerDuty, email, etc.
def send_alerts(alerts: list[str]):
"""Send alerts via your preferred notification channel."""
for alert in alerts:
print(alert)
# Integration with Slack, PagerDuty, etc.
Evaluating Agents and Multi-Step Workflows
Evaluating a single LLM call is relatively straightforward. Evaluating an agent that makes multiple calls, uses tools, maintains state, and produces outputs across several steps? That's a whole different ballgame.
The metrics we discussed earlier still apply, but they need to be applied at multiple levels.
Trajectory Evaluation
For agent systems, you care not just about the final output but about the path the agent took to get there. Did it use the right tools? Did it call them in a sensible order? Did it waste steps on unnecessary actions? Trajectory evaluation measures the quality of the agent's decision-making process.
from deepeval.test_case import LLMTestCase, ToolCall
from deepeval.metrics import ToolCorrectnessMetric
# Define the expected tool usage
tool_correctness = ToolCorrectnessMetric(
threshold=0.8,
model="gpt-4o",
)
def test_agent_uses_correct_tools():
"""Verify the agent selects appropriate tools."""
test_case = LLMTestCase(
input="Find recent papers about RLHF and summarize them",
actual_output="Here are 3 recent papers on RLHF...",
tools_called=[
ToolCall(name="arxiv_search", input_parameters={
"query": "RLHF reinforcement learning human feedback",
"max_results": 5
}),
ToolCall(name="summarize", input_parameters={
"text": "...",
"style": "technical"
}),
],
expected_tools=[
ToolCall(name="arxiv_search"),
ToolCall(name="summarize"),
],
)
assert_test(test_case, [tool_correctness])
End-to-End vs. Component Evaluation
The most effective strategy evaluates both the components and the integrated system. Component-level evaluations test each piece in isolation — the retriever's precision, the generator's faithfulness, the router's classification accuracy. End-to-end evaluations test the full pipeline with real user queries and measure the final output quality.
Both are necessary. Component evaluations pinpoint where issues originate. End-to-end evaluations tell you whether those issues actually matter to the user. A retriever with 0.75 precision might be perfectly fine if the generator is robust enough to ignore irrelevant chunks. Or a generator with perfect faithfulness might still produce terrible outputs if the retriever is feeding it garbage. You need both lenses.
Building a Complete Evaluation Strategy
Bringing it all together, here's the evaluation strategy I'd recommend for any production LLM application. Think of it as three layers, each catching different types of failures at different stages of your development lifecycle.
Layer 1: Development-Time Evaluation
During development, run focused evaluations as you iterate on prompts and retrieval logic. Use a small, targeted golden dataset (20–50 examples) that covers your core use cases. The goal here is fast feedback — you should be able to run a development eval in under 2 minutes. Keep the evaluation suite lean and only add tests when you discover new failure modes.
Layer 2: CI/CD Quality Gates
On every pull request, run the full evaluation suite against your complete golden dataset (100–200+ examples). This is where you catch regressions before they reach production. Set blocking thresholds for safety-critical metrics (hallucination, faithfulness) and advisory thresholds for everything else. Generate a report that reviewers can read directly on the PR.
Layer 3: Production Monitoring
In production, continuously sample and evaluate live traffic. Score a representative subset (5–10%) of production queries using your evaluation metrics. Track quality metrics over time and alert when they drift. Feed production failures back into your golden dataset to prevent recurrence. Review evaluation results weekly and update thresholds as your system matures.
The Feedback Loop
The most important aspect of this strategy isn't any individual layer — it's the feedback loop between them. Production monitoring discovers new failure modes. Those failures become golden dataset entries. The updated golden dataset catches regressions in CI/CD. CI/CD prevents those failures from reaching production again.
Each failure makes the system permanently more robust. Over months, this virtuous cycle builds an evaluation suite that's deeply tailored to your specific application, your specific users, and your specific failure modes. It's one of those things that feels slow at first but compounds incredibly fast.
Key Takeaways
LLM evaluation in 2026 isn't optional — it's infrastructure. Here's what to remember:
- Choose metrics based on your application type. RAG systems need faithfulness and contextual relevancy. Agents need tool correctness and trajectory evaluation. Chatbots need answer relevancy and conversation coherence. Don't evaluate everything — evaluate what matters.
- Start with DeepEval and built-in metrics. They cover 80% of evaluation needs with minimal setup. Graduate to custom GEval metrics when you need domain-specific quality criteria.
- Build golden datasets from real data. Production logs, expert-authored must-pass cases, and known failures make far better test suites than synthetic examples. Start with 100 high-quality examples and grow from there.
- Integrate into CI/CD from day one. An evaluation suite that only runs manually is an evaluation suite that stops running. Automate it, make results visible on PRs, and set blocking thresholds for safety-critical metrics.
- Monitor production continuously. The system that passed all evals last week can degrade tomorrow due to data drift, context changes, or upstream model updates. Continuous monitoring with alerting closes the gap between "it worked in testing" and "it works in production."
- Feed failures back into tests. Every production bug is a free test case. Capture it, add it to your golden dataset, and make sure it never happens again. This feedback loop is what separates mature LLM applications from fragile demos.
The tooling exists. The patterns are proven. The only question is whether you'll build evaluation into your LLM application now — or after your users have already found the problems for you.