LLM Observability Setup Guide 2026

The Observability Gap in Production AI Systems

You've built your RAG pipeline. You've orchestrated your multi-agent system. You've tuned your prompts with DSPy and locked down guardrails. Everything works beautifully in staging. Then you deploy to production, and within days, you're completely flying blind.

A customer reports that the chatbot gave them totally wrong information about their account. Your boss asks why the AI bill tripled last month. A developer wants to know why the agent took 47 seconds to answer a simple question. And you have absolutely no idea where to start debugging any of it.

Sound familiar? Welcome to the observability gap — honestly, it's the most under-invested layer of the production AI stack right now.

Traditional APM tools like Datadog, New Relic, and Grafana are fantastic at tracking HTTP latency, error rates, and CPU utilization. But LLM applications play by entirely different rules. The “logic” isn't in your code — it's in natural-language prompts interpreted by stochastic models. The cost isn't in compute time — it's in tokens. And the bugs aren't exceptions — they're hallucinations, irrelevant retrievals, and subtly wrong reasoning that your users trust because the model delivered it with absolute confidence.

In 2026, organizations adopting comprehensive AI observability platforms report up to 40% faster time-to-production compared to those using fragmented tooling. Yet so many teams are still stitching together print() statements and custom logging to understand what their LLM apps are doing. That's exactly what we're going to fix here.

We'll build a production-grade observability stack for LLM applications from the ground up — covering tracing, cost attribution, quality monitoring, and alerting — using both open-source tools and the emerging OpenTelemetry GenAI standard that's rapidly becoming the industry baseline.

Why Traditional Observability Breaks Down for LLMs

Before we dive into solutions, let's understand exactly why your existing monitoring stack falls short.

The Non-Determinism Problem

Traditional software is deterministic: same input, same output. If a function returns the wrong value, you can reproduce the bug, write a test, and fix it. LLMs don't work that way. The same prompt can produce different outputs across calls, and “correct” isn't binary — it's a spectrum of relevance, accuracy, and helpfulness. You need to capture the full context of every interaction (prompt, retrieved documents, model response, tool calls) just to have a shot at diagnosing issues.

The Cost Dimension

In traditional applications, cost scales roughly linearly with compute. In LLM applications? A single poorly designed prompt or a runaway agent loop can burn through hundreds of dollars in minutes. Token usage is your new CPU — and you need per-request, per-feature, and per-user cost attribution to manage it effectively.

The Latency Composition Problem

An LLM pipeline might involve a vector database lookup (50ms), a reranking step (200ms), three sequential model calls (2–8 seconds each), and two tool invocations (variable). The total latency is the sum of all those components, and the bottleneck shifts depending on input complexity, context length, and model load. Endpoint-level P99 metrics won't cut it — you need trace-level decomposition to actually pinpoint what's slow.

The Quality Measurement Challenge

When a REST API returns a 500 error, you know something broke. When an LLM returns a confident, well-formatted, completely hallucinated response? Nothing in your traditional monitoring will flag it. Quality monitoring for LLMs requires evaluating semantic correctness — and that's a fundamentally different measurement paradigm.

The Four Pillars of LLM Observability

A production-grade LLM observability stack rests on four pillars, each addressing a specific class of questions:

Tracing: What happened during this request? Which steps were executed, in what order, with what inputs and outputs?
Metrics: How is the system performing overall? What are the latency distributions, error rates, token usage patterns, and costs?
Quality Monitoring: Are the outputs actually good? Are we hallucinating? Is retrieval relevance degrading?
Alerting & Anomaly Detection: When should a human be notified? What patterns indicate emerging problems?

So, let's build each one.

Pillar 1: Distributed Tracing for LLM Pipelines

Tracing is the foundation. Without it, everything else is guesswork.

A trace captures the full execution path of a single request through your system — every LLM call, every retrieval, every tool invocation, every reranking step — as a tree of spans with timing, inputs, and outputs.

OpenTelemetry GenAI Semantic Conventions

In 2026, the industry has largely converged on OpenTelemetry (OTel) as the standard for distributed tracing, and the GenAI Semantic Conventions (v1.37+) now provide a standardized schema specifically for AI workloads. These conventions define how to capture prompts, model responses, token usage, tool calls, and provider metadata in a vendor-neutral format.

The key span types defined by the conventions include:

GenAI client spans: Represent client calls to generative AI models, with span names formatted as {gen_ai.operation.name} {gen_ai.request.model}
GenAI agent spans: Represent agent execution cycles including reasoning, tool selection, and action execution
Tool call spans: Capture individual tool invocations triggered by the model

Here's how to set up OpenTelemetry tracing for an LLM application with the GenAI conventions:

import os
from opentelemetry import trace
from opentelemetry.exporter.otlp.proto.grpc.trace_exporter import OTLPSpanExporter
from opentelemetry.sdk.resources import SERVICE_NAME, Resource
from opentelemetry.sdk.trace import TracerProvider
from opentelemetry.sdk.trace.export import BatchSpanProcessor

# Configure the tracer provider
resource = Resource(attributes={
    SERVICE_NAME: "my-ai-service",
    "deployment.environment": os.getenv("ENVIRONMENT", "production"),
})
tracer_provider = TracerProvider(resource=resource)

# Export spans to your OTel collector (or directly to Langfuse, Datadog, etc.)
otlp_exporter = OTLPSpanExporter(
    endpoint=os.getenv("OTEL_EXPORTER_OTLP_ENDPOINT", "localhost:4317"),
    insecure=os.getenv("ENVIRONMENT") != "production",
)
tracer_provider.add_span_processor(BatchSpanProcessor(otlp_exporter))
trace.set_tracer_provider(tracer_provider)

tracer = trace.get_tracer("ai-service.llm", "1.0.0")

With the tracer configured, you can instrument individual LLM calls using the GenAI semantic attributes:

from opentelemetry.semconv._incubating.attributes import gen_ai_attributes

async def call_llm(prompt: str, model: str = "claude-sonnet-4-5-20250929") -> str:
    with tracer.start_as_current_span(f"chat {model}") as span:
        # Set GenAI semantic attributes
        span.set_attribute(gen_ai_attributes.GEN_AI_SYSTEM, "anthropic")
        span.set_attribute(gen_ai_attributes.GEN_AI_REQUEST_MODEL, model)
        span.set_attribute(gen_ai_attributes.GEN_AI_REQUEST_TEMPERATURE, 0.7)
        span.set_attribute(gen_ai_attributes.GEN_AI_REQUEST_MAX_TOKENS, 1024)

        # Make the actual API call
        response = await client.messages.create(
            model=model,
            max_tokens=1024,
            temperature=0.7,
            messages=[{"role": "user", "content": prompt}],
        )

        # Record response attributes
        span.set_attribute(
            gen_ai_attributes.GEN_AI_RESPONSE_MODEL, response.model)
        span.set_attribute(
            gen_ai_attributes.GEN_AI_USAGE_INPUT_TOKENS,
            response.usage.input_tokens)
        span.set_attribute(
            gen_ai_attributes.GEN_AI_USAGE_OUTPUT_TOKENS,
            response.usage.output_tokens)

        # Record prompt and completion as span events
        span.add_event("gen_ai.user.message", {"gen_ai.prompt": prompt})
        span.add_event("gen_ai.assistant.message", {
            "gen_ai.completion": response.content[0].text
        })

        return response.content[0].text

Using Langfuse for LLM-Native Tracing

While raw OpenTelemetry gives you full control, Langfuse — the most widely adopted open-source LLM observability platform with over 19,000 GitHub stars — provides a higher-level tracing experience purpose-built for AI workloads. It can also serve as an OpenTelemetry backend, so you genuinely get the best of both worlds.

from langfuse import Langfuse
from langfuse.decorators import observe, langfuse_context

# Initialize the client
langfuse = Langfuse(
    public_key=os.getenv("LANGFUSE_PUBLIC_KEY"),
    secret_key=os.getenv("LANGFUSE_SECRET_KEY"),
    host=os.getenv("LANGFUSE_HOST", "https://cloud.langfuse.com"),
)

@observe()
def answer_question(user_query: str) -> str:
    # Step 1: Retrieve relevant documents
    docs = retrieve_documents(user_query)

    # Step 2: Rerank results
    ranked_docs = rerank(user_query, docs)

    # Step 3: Generate answer
    answer = generate_answer(user_query, ranked_docs)

    # Add metadata for filtering and analysis
    langfuse_context.update_current_trace(
        user_id="user-123",
        session_id="session-abc",
        metadata={"pipeline_version": "2.1.0"},
        tags=["production", "rag-pipeline"],
    )

    return answer

@observe()
def retrieve_documents(query: str) -> list:
    results = vector_store.similarity_search(query, k=10)
    langfuse_context.update_current_observation(
        metadata={"num_results": len(results), "index": "knowledge-base-v3"}
    )
    return results

@observe()
def rerank(query: str, docs: list) -> list:
    ranked = cross_encoder.rank(query, docs)
    return ranked[:5]

@observe(as_type="generation")
def generate_answer(query: str, context_docs: list) -> str:
    context = "\n\n".join([doc.page_content for doc in context_docs])

    response = client.messages.create(
        model="claude-sonnet-4-5-20250929",
        max_tokens=2048,
        messages=[{
            "role": "user",
            "content": f"Context:\n{context}\n\nQuestion: {query}\nAnswer:"
        }],
    )

    langfuse_context.update_current_observation(
        model="claude-sonnet-4-5-20250929",
        usage_details={
            "input": response.usage.input_tokens,
            "output": response.usage.output_tokens,
        },
    )

    return response.content[0].text

The @observe() decorator automatically creates nested spans reflecting your call hierarchy. Each function becomes a span in the trace, giving you a complete picture of every request — from query to response — with timing for each step. It's honestly one of the lowest-effort, highest-payoff instrumentation patterns I've seen.

Production Tracing Best Practices

When tracing at scale, keep these production considerations in mind:

Sampling: For high-volume systems, trace 10–20% of requests at full fidelity while logging basic metrics (latency, token count, cost) for 100% of traffic. Langfuse supports this via the sample_rate parameter.
Sensitive data: Mask or redact PII from prompts and completions before sending to your observability backend. Most platforms support custom scrubbing functions.
Async flushing: Both Langfuse and OpenTelemetry batch and send traces asynchronously. For short-lived applications (serverless functions, CLI tools), always call langfuse.flush() or tracer_provider.force_flush() before the process exits — otherwise you'll lose traces silently.
Distributed trace context: Use W3C Trace Context propagation headers to connect frontend, backend, and AI service spans into a single unified trace.

Pillar 2: Metrics and Cost Attribution

Traces tell you what happened on individual requests. Metrics tell you how the system is performing over time. For LLM applications, the essential metrics fall into four categories.

Latency Metrics

Track these at both the pipeline and component level:

End-to-end latency (P50, P95, P99) — the user-facing response time
Time to first token (TTFT) — critical for streaming responses where perceived speed matters more than total generation time
Per-component latency — retrieval, reranking, generation, tool execution breakdowns
Model-specific latency — compare response times across providers and model versions

Token Usage and Cost Tracking

Token usage is the primary cost driver in LLM applications. You need multi-dimensional attribution to actually understand where the money's going:

from dataclasses import dataclass, field
from collections import defaultdict

@dataclass
class TokenMetrics:
    _usage: dict = field(default_factory=lambda: defaultdict(lambda: {
        "input_tokens": 0,
        "output_tokens": 0,
        "total_cost_usd": 0.0,
        "request_count": 0,
    }))

    # Cost per million tokens (update as pricing changes)
    MODEL_PRICING = {
        "claude-sonnet-4-5-20250929": {"input": 3.00, "output": 15.00},
        "claude-haiku-4-5-20251001": {"input": 0.80, "output": 4.00},
        "gpt-4o": {"input": 2.50, "output": 10.00},
        "gpt-4o-mini": {"input": 0.15, "output": 0.60},
    }

    def record(self, model: str, input_tokens: int, output_tokens: int,
               user_id: str = None, feature: str = None):
        pricing = self.MODEL_PRICING.get(model, {"input": 5.0, "output": 15.0})
        cost = (
            input_tokens * pricing["input"]
            + output_tokens * pricing["output"]
        ) / 1_000_000

        # Attribute by multiple dimensions
        for dimension_key in [
            f"model:{model}",
            f"user:{user_id}" if user_id else None,
            f"feature:{feature}" if feature else None,
            "global",
        ]:
            if dimension_key:
                self._usage[dimension_key]["input_tokens"] += input_tokens
                self._usage[dimension_key]["output_tokens"] += output_tokens
                self._usage[dimension_key]["total_cost_usd"] += cost
                self._usage[dimension_key]["request_count"] += 1

    def get_cost_by_dimension(self, prefix: str) -> dict:
        return {
            k: v for k, v in self._usage.items()
            if k.startswith(prefix)
        }

For production systems, export these metrics to Prometheus or your preferred metrics backend:

from prometheus_client import Counter, Histogram, Gauge

# Token usage counters
llm_input_tokens = Counter(
    "llm_input_tokens_total",
    "Total input tokens consumed",
    ["model", "feature", "environment"],
)
llm_output_tokens = Counter(
    "llm_output_tokens_total",
    "Total output tokens consumed",
    ["model", "feature", "environment"],
)
llm_cost_usd = Counter(
    "llm_cost_usd_total",
    "Total LLM cost in USD",
    ["model", "feature", "environment"],
)

# Latency histograms
llm_request_duration = Histogram(
    "llm_request_duration_seconds",
    "LLM request duration",
    ["model", "operation"],
    buckets=[0.1, 0.25, 0.5, 1.0, 2.5, 5.0, 10.0, 30.0, 60.0],
)

# Quality gauge
llm_quality_score = Gauge(
    "llm_quality_score",
    "Rolling average quality score from evaluations",
    ["pipeline", "metric"],
)

Building a Cost Dashboard

With Prometheus metrics flowing, create a Grafana dashboard that answers these questions at a glance:

Total daily/weekly/monthly spend — broken down by model, feature, and team
Cost per request — identify which features are disproportionately expensive
Token efficiency — ratio of output tokens to input tokens, flagging bloated prompts
Cost anomalies — sudden spikes that might indicate runaway loops or prompt injection
Budget burn rate — projected monthly spend based on current trajectory

Teams using structured cost attribution have reported cutting LLM spend by up to 90% just by identifying and fixing the top cost drivers: oversized context windows, unnecessary tool calls, and inefficient prompt templates. That's not a typo — the waste in unmonitored LLM systems can be staggering.

Pillar 3: Quality Monitoring and Online Evaluation

This is where LLM observability fundamentally diverges from traditional monitoring. A 200 OK response that contains hallucinated information is worse than a 500 error — because users actually trust it.

Online Evaluation Pipeline

The idea here is to run lightweight quality checks on a sample of production traffic. Here's a pattern for asynchronous online evaluation:

import asyncio
from enum import Enum

class QualityDimension(str, Enum):
    FAITHFULNESS = "faithfulness"
    RELEVANCE = "relevance"
    COMPLETENESS = "completeness"
    TOXICITY = "toxicity"

async def evaluate_response(
    query: str,
    response: str,
    context_docs: list[str],
    dimensions: list[QualityDimension],
    trace_id: str,
) -> dict[str, float]:
    scores = {}

    for dimension in dimensions:
        evaluation_prompt = build_evaluation_prompt(
            dimension, query, response, context_docs
        )

        # Use a fast, cheap model for evaluation
        result = await eval_client.messages.create(
            model="claude-haiku-4-5-20251001",
            max_tokens=256,
            messages=[{"role": "user", "content": evaluation_prompt}],
        )

        score = parse_score(result.content[0].text)
        scores[dimension.value] = score

        # Report to Langfuse for correlation with traces
        langfuse.score(
            trace_id=trace_id,
            name=dimension.value,
            value=score,
            comment=result.content[0].text,
        )

        # Export to Prometheus for dashboarding
        llm_quality_score.labels(
            pipeline="rag-qa", metric=dimension.value
        ).set(score)

    return scores

The evaluation prompt builder generates dimension-specific rubrics. For faithfulness, it checks whether every claim in the response is grounded in the retrieved context. For relevance, it verifies that the response actually addresses the user's query. Each evaluation returns a 0.0–1.0 score with reasoning.

Retrieval Quality Monitoring

For RAG systems, it's worth monitoring retrieval quality separately from generation quality. The retrieval layer is often the actual root cause of bad responses, and tracking it independently lets you pinpoint issues way faster:

Retrieval precision: What fraction of retrieved documents are actually relevant?
Context utilization: How much of the retrieved context does the model actually use in its response?
Retrieval latency distribution: Are vector search times degrading as your index grows?
Empty retrieval rate: How often does the retriever return no relevant results?

@observe()
def monitor_retrieval_quality(query: str, retrieved_docs: list, response: str):
    total_context_chars = sum(len(doc.page_content) for doc in retrieved_docs)
    used_chars = sum(
        len(doc.page_content)
        for doc in retrieved_docs
        if any(
            sentence in response
            for sentence in doc.page_content.split(". ")[:3]
        )
    )
    utilization = used_chars / total_context_chars if total_context_chars > 0 else 0

    langfuse_context.update_current_observation(
        metadata={
            "retrieval_count": len(retrieved_docs),
            "context_utilization": round(utilization, 2),
            "total_context_tokens": total_context_chars // 4,
        }
    )

    if utilization < 0.1:
        logger.warning(
            "Low context utilization detected",
            extra={"query": query, "utilization": utilization},
        )

User Feedback Integration

Automated evaluation catches systematic issues, but user feedback catches the nuances that LLM judges miss. Don't underestimate the value of a simple thumbs-up/thumbs-down signal — integrate it directly into your traces:

@app.post("/api/feedback")
async def submit_feedback(
    trace_id: str,
    rating: int,
    comment: str = None,
):
    langfuse.score(
        trace_id=trace_id,
        name="user_feedback",
        value=rating,
        comment=comment,
    )

    # Correlate user feedback with automated scores
    trace = langfuse.get_trace(trace_id)
    automated_scores = {
        s.name: s.value for s in trace.scores
        if s.name != "user_feedback"
    }

    if rating <= 2 and automated_scores.get("faithfulness", 1.0) > 0.8:
        # User says bad, but automated eval said good
        logger.warning(
            "Evaluation calibration gap detected",
            extra={
                "trace_id": trace_id,
                "user_rating": rating,
                "auto_scores": automated_scores,
            },
        )

    return {"status": "recorded"}

That calibration gap detection (where users rate something poorly but your automated eval scored it high) is genuinely one of the most valuable signals you can track. It tells you exactly where your evaluation rubrics need improvement.

Pillar 4: Alerting and Anomaly Detection

Observability without alerting is just data collection. You need to know when things go wrong before your users tell you.

Essential Alerts for LLM Systems

At a minimum, configure alerts for these conditions:

# Prometheus alerting rules (alerts.yml)
groups:
  - name: llm_alerts
    rules:
      # Cost spike detection
      - alert: LLMCostSpike
        expr: |
          rate(llm_cost_usd_total[1h])
            > 2 * rate(llm_cost_usd_total[1h] offset 1d)
        for: 15m
        labels:
          severity: warning
        annotations:
          summary: "LLM cost rate is 2x higher than yesterday"

      # Latency degradation
      - alert: LLMLatencyHigh
        expr: |
          histogram_quantile(0.95,
            rate(llm_request_duration_seconds_bucket[5m])) > 10
        for: 10m
        labels:
          severity: warning
        annotations:
          summary: "LLM P95 latency exceeds 10 seconds"

      # Quality degradation
      - alert: LLMQualityDrop
        expr: llm_quality_score{metric="faithfulness"} < 0.7
        for: 30m
        labels:
          severity: critical
        annotations:
          summary: "Faithfulness score dropped below 0.7"

      # Error rate spike
      - alert: LLMErrorRateHigh
        expr: |
          rate(llm_errors_total[5m])
            / rate(llm_requests_total[5m]) > 0.05
        for: 5m
        labels:
          severity: critical
        annotations:
          summary: "LLM error rate exceeds 5%"

Anomaly Detection Patterns

Beyond threshold-based alerts, you'll want statistical anomaly detection for patterns where static thresholds just don't make sense:

Token usage distribution shifts: If average input tokens suddenly doubles, a prompt might have changed or a retrieval pipeline might be returning way too many documents.
Response length anomalies: Unusually short or long responses can indicate model confusion or prompt injection.
Tool call frequency changes: An agent suddenly making 10x more tool calls could indicate a reasoning loop — and those loops get expensive fast.
Cost per user outliers: One user consuming 100x the average cost could indicate abuse or a stuck session.

Choosing Your Observability Stack

The LLM observability market has matured significantly in 2026. Here's how to think about choosing the right setup for your team.

Open-Source Self-Hosted: Langfuse + OpenTelemetry + Prometheus/Grafana

Best for: Teams with data residency requirements, high volume (where per-request pricing would hurt), and existing infrastructure expertise.

Langfuse handles LLM-specific tracing, prompt management, evaluation, and cost tracking. Self-host with Docker or Kubernetes. MIT licensed with no restrictions.
OpenTelemetry provides the vendor-neutral instrumentation layer. Your code instruments with OTel, and you choose where the data goes.
Prometheus + Grafana handle time-series metrics, dashboarding, and alerting.

This stack gives you full control and zero per-request costs, but you'll need to maintain the infrastructure yourself. If you've got a platform team that's comfortable running services, it's a great choice.

Managed Platform: LangSmith, Datadog LLM Observability, or Helicone

Best for: Teams that want to move fast without managing infrastructure.

LangSmith is the strongest choice for LangChain/LangGraph users. It integrates automatically, adds virtually no measurable overhead, and now supports OpenTelemetry ingestion from non-LangChain applications too.
Datadog LLM Observability is ideal if your organization already uses Datadog for APM. It natively supports OTel GenAI Semantic Conventions (v1.37+), so you can correlate LLM traces with infrastructure metrics in one platform.
Helicone takes a proxy-based approach — route your API calls through Helicone's proxy and get instant observability with zero code changes. It's surprisingly effective for rapid adoption.

Hybrid: OpenTelemetry Instrumentation + Multiple Backends

Best for: Teams that want flexibility without vendor lock-in.

The most future-proof architecture (in my opinion) is to instrument everything with OpenTelemetry GenAI conventions and then route traces to whichever backends serve your needs. Send to Langfuse for AI-specific analysis, to Datadog for infrastructure correlation, and to your data warehouse for long-term analytics — all from a single instrumentation layer.

# docker-compose.yml — OTel Collector routing to multiple backends
services:
  otel-collector:
    image: otel/opentelemetry-collector-contrib:latest
    command: ["--config", "/etc/otel-collector-config.yaml"]
    volumes:
      - ./otel-collector-config.yaml:/etc/otel-collector-config.yaml
    ports:
      - "4317:4317"   # gRPC OTLP receiver
      - "4318:4318"   # HTTP OTLP receiver

# otel-collector-config.yaml
receivers:
  otlp:
    protocols:
      grpc:
        endpoint: 0.0.0.0:4317
      http:
        endpoint: 0.0.0.0:4318

processors:
  batch:
    timeout: 5s
    send_batch_size: 256

exporters:
  otlphttp/langfuse:
    endpoint: "https://cloud.langfuse.com/api/public/otel"
    headers:
      Authorization: "Basic ${LANGFUSE_BASE64_CREDENTIALS}"
  otlp/datadog:
    endpoint: "https://trace.agent.datadoghq.com"
    headers:
      DD-API-KEY: "${DD_API_KEY}"

service:
  pipelines:
    traces:
      receivers: [otlp]
      processors: [batch]
      exporters: [otlphttp/langfuse, otlp/datadog]

A Complete Production Observability Setup: Step by Step

Let's bring everything together into a cohesive setup you can actually deploy this week.

Step 1: Instrument Your Application

Wrap your AI pipeline with Langfuse decorators for LLM-native tracing and OpenTelemetry for infrastructure spans:

# app/observability.py
import os
from langfuse import Langfuse
from langfuse.decorators import observe
from prometheus_client import Counter, Histogram

# Langfuse for AI-specific tracing
langfuse = Langfuse(
    public_key=os.getenv("LANGFUSE_PUBLIC_KEY"),
    secret_key=os.getenv("LANGFUSE_SECRET_KEY"),
    host=os.getenv("LANGFUSE_HOST", "https://cloud.langfuse.com"),
    sample_rate=float(os.getenv("LANGFUSE_SAMPLE_RATE", "0.2")),
)

# Prometheus metrics
request_counter = Counter(
    "ai_requests_total", "Total AI requests", ["pipeline", "status"]
)
latency_hist = Histogram(
    "ai_request_seconds", "Request latency", ["pipeline"]
)
token_counter = Counter(
    "ai_tokens_total", "Tokens consumed", ["model", "direction"]
)
cost_counter = Counter(
    "ai_cost_usd", "Cost in USD", ["model", "pipeline"]
)

Step 2: Add Quality Gates

Sample production traffic for online evaluation. Run evaluations asynchronously so you don't add latency to user requests:

import random
import asyncio

EVAL_SAMPLE_RATE = 0.1  # Evaluate 10% of production requests

@observe()
async def rag_pipeline(query: str, user_id: str) -> dict:
    with latency_hist.labels(pipeline="rag").time():
        docs = await retrieve(query)
        answer = await generate(query, docs)

        request_counter.labels(pipeline="rag", status="success").inc()

        # Sample for quality evaluation (non-blocking)
        if random.random() < EVAL_SAMPLE_RATE:
            trace_id = langfuse_context.get_current_trace_id()
            asyncio.create_task(
                evaluate_response(
                    query=query,
                    response=answer,
                    context_docs=[d.page_content for d in docs],
                    dimensions=[
                        QualityDimension.FAITHFULNESS,
                        QualityDimension.RELEVANCE,
                    ],
                    trace_id=trace_id,
                )
            )

        return {"answer": answer, "trace_id": trace_id}

Step 3: Deploy Monitoring Infrastructure

Use the OTel Collector configuration shown earlier to route traces to your backends. Deploy Prometheus and Grafana for metrics, and configure the alerting rules to catch cost spikes, latency degradation, and quality drops. This part is fairly standard infrastructure work — the LLM-specific magic is all in the instrumentation and evaluation layers.

Step 4: Build Your Debugging Workflow

When something goes wrong (and it will), follow this triage path:

Alert fires (e.g., “Faithfulness score dropped below 0.7”)
Check Grafana dashboard — Is it a spike or a trend? Which pipeline/model/feature is affected?
Query Langfuse — Filter traces by low quality scores, inspect the actual prompts, retrieved documents, and model responses
Identify root cause — Common culprits: stale embeddings, prompt regression from a recent deployment, model version change, or retrieval index corruption
Fix and verify — Deploy the fix, monitor the quality score, confirm recovery

Having this workflow documented and practiced before your first production incident makes all the difference. Trust me on that one.

Key Takeaways and Next Steps

LLM observability isn't optional — it's the difference between an AI demo and an AI product. Here's what to take away:

Start with tracing. You can't optimize what you can't see. Instrument your pipelines with Langfuse decorators or OpenTelemetry GenAI conventions before doing anything else.
Track costs from day one. Token-level cost attribution pays for itself within weeks by exposing inefficient prompts, unnecessary context, and runaway agent loops.
Automate quality monitoring. LLM-as-judge evaluation on a sample of production traffic catches hallucination regressions before users report them.
Instrument with OpenTelemetry. The GenAI Semantic Conventions are the emerging standard. Instrumenting with OTel protects you from vendor lock-in and lets you route data to any backend.
Alert on what matters. Cost spikes, latency degradation, quality score drops, and error rate changes — those are your four essential alert categories.

The production LLM observability stack is no longer experimental — it's a well-defined architecture with mature tooling. The teams that invest in it now will iterate faster, spend less, and deliver more reliable AI experiences. The ones that don't will keep flying blind, debugging customer complaints with print() statements, and wondering why their AI bill keeps climbing.

In the next article in this series, we'll explore fine-tuning strategies — when to fine-tune, how to prepare data, and how to evaluate whether your fine-tuned model actually outperforms prompt engineering. But first, get your observability house in order. You'll need it to measure whether that fine-tune was worth it.