Context Engineering for Production AI Agents: A Python Implementation Guide

Context engineering — curating what an LLM sees at inference time — is now the defining skill for AI engineers. This guide walks through the four core strategies (write, select, compress, isolate) with production Python implementations using LangGraph, reranking pipelines, and multi-agent isolation.

Why Context Engineering Is the #1 Job of AI Engineers in 2026

Here's a truth that took the AI engineering community a few years to fully absorb: the quality of an agent's output depends less on the model itself and more on what you feed into its context window. You can swap GPT-4.1 for Claude Opus 4.6 or Gemini 3.1 Pro — honestly, it doesn't matter much. If the context is poorly structured, bloated, or stale, your agent will still produce mediocre results. On the flip side, a well-engineered context pipeline can make a mid-tier model outperform a frontier one running on raw conversation history.

That realization gave rise to context engineering — the discipline of designing, selecting, structuring, compressing, and maintaining the information that enters an LLM's context window at inference time. As Andrej Karpathy put it, it's "the delicate art and science of filling the context window with just the right information for the next step."

If prompt engineering taught us how to ask the question, context engineering teaches us what the model should know when answering it.

And for production AI agents — systems running multi-step workflows with dozens of LLM calls, tool invocations, and memory retrievals — getting context right is the difference between a reliable system and one that fails unpredictably at 3 AM. I've seen this firsthand: a well-tuned context pipeline turned a flaky research agent into something our team could actually trust overnight.

This guide covers context engineering from first principles through production Python implementations. You'll learn the four core strategies (write, select, compress, isolate), implement each one with working code, and understand the trade-offs that determine which patterns fit your use case.

The Context Window: Your Agent's Working Memory

Think of the LLM as a CPU and the context window as RAM. It's the model's working memory — everything it can "see" during a single inference call. Every token matters: system instructions, conversation history, retrieved documents, tool definitions, tool call results, and the user's actual question all compete for the same finite space.

The Numbers in 2026

Modern models advertise impressive context windows — Claude supports 200K tokens, Gemini 3 Pro reaches 2 million tokens, and GPT-4.5 provides 256K tokens. These numbers are misleading in practice, though, for three reasons:

  • Context rot: A 2025 Chroma study tested 18 frontier models and found that every single one performed worse as input grew. The NoLiMa study confirmed that at 32K tokens, 11 out of 12 models dropped below 50% of their short-context performance. That's a pretty dramatic cliff.
  • Cost scales non-linearly: Input tokens are billed per token, and longer contexts mean higher costs. Full-context approaches can cost 14x more than selective memory approaches with only marginally better accuracy.
  • Latency compounds: Inference latency for 2 million tokens can reach 30–60 seconds — way too slow for real-time agent interactions.

The pragmatic takeaway: a smaller, tightly curated context consistently outperforms a large context filled with loosely relevant material. Context engineering is the discipline of achieving that curation systematically.

What Competes for Context Space

Every component of an agent's context consumes tokens from the same finite budget:

┌─────────────────────────────────────────┐
│           CONTEXT WINDOW                │
│                                         │
│  ┌─────────────────────────────────┐    │
│  │ System Prompt (500–2000 tokens) │    │
│  ├─────────────────────────────────┤    │
│  │ Tool Definitions (200–500 each) │    │
│  ├─────────────────────────────────┤    │
│  │ Retrieved Documents (RAG)       │    │
│  ├─────────────────────────────────┤    │
│  │ Memory / Prior Knowledge        │    │
│  ├─────────────────────────────────┤    │
│  │ Conversation History            │    │
│  ├─────────────────────────────────┤    │
│  │ Current User Query              │    │
│  └─────────────────────────────────┘    │
│                                         │
│  Remaining = generation headroom        │
└─────────────────────────────────────────┘

Every token you allocate to one component reduces the headroom available for others and for the model's response. Context engineering is fundamentally a resource allocation problem — and the four strategies below are your tools for solving it.

The Four Core Strategies: Write, Select, Compress, Isolate

Anthropic's engineering team distilled context engineering into four primary strategies in their widely-cited blog post on effective context engineering for AI agents. These strategies aren't mutually exclusive — production systems typically combine all four. So, let's implement each one.

Strategy 1: Write — Persist Information Outside the Context

The simplest and (honestly) most underused strategy: don't force the model to remember everything. Write critical information to external storage where it can be reliably accessed when needed, rather than keeping it in the ever-growing context window.

This mirrors how humans work. We take notes, maintain to-do lists, and write documentation precisely because our working memory is limited. AI agents should do the same.

Implementing Scratchpads with LangGraph

A scratchpad is a persistent note-taking area that the agent can write to and read from across steps. Here's a production-ready implementation using LangGraph:

from langgraph.graph import StateGraph, MessagesState
from typing import Annotated
import operator

class AgentState(MessagesState):
    """Extended state with a scratchpad for persistent notes."""
    scratchpad: Annotated[list[str], operator.add]

def research_node(state: AgentState, config):
    """Agent node that writes findings to the scratchpad."""
    messages = state["messages"]
    scratchpad = state.get("scratchpad", [])

    # Build context-aware prompt with scratchpad contents
    scratchpad_context = ""
    if scratchpad:
        scratchpad_context = (
            "\n\nYour research notes so far:\n"
            + "\n".join(f"- {note}" for note in scratchpad)
        )

    response = llm.invoke([
        {"role": "system", "content": (
            "You are a research agent. After each research step, "
            "write a concise summary of your findings to your scratchpad. "
            "Use the scratchpad to track what you have learned and what "
            "questions remain."
            f"{scratchpad_context}"
        )},
        *messages
    ])

    # Extract notes from the response and persist them
    new_notes = extract_notes(response.content)
    return {
        "messages": [response],
        "scratchpad": new_notes
    }

graph = StateGraph(AgentState)
graph.add_node("research", research_node)
# ... add edges, compile with checkpointer ...

The key insight here: the scratchpad persists across graph steps but doesn't grow the conversation history. The agent can write 50 findings to the scratchpad and then use only the relevant ones in each subsequent step, rather than carrying all 50 in the message history. That's a massive difference in practice.

Structured Note-Taking with External Files

Claude Code uses a particularly elegant variation of this pattern: it writes structured notes to a NOTES.md file that persists outside the context window entirely. The agent can read the file when needed and update it as new information emerges. This works because:

  • Notes survive context window compaction (summarization)
  • The agent can selectively read only relevant sections
  • Multiple agents can share the same notes file

Strategy 2: Select — Pull the Right Context In

Selection is the decision-making process of what to include in the context window. RAG (Retrieval-Augmented Generation) is the best-known selection technique, but context engineering goes further — it also covers tool selection, memory retrieval, and conversation history filtering.

Implementing Relevance-Scored RAG with Reranking

Naive RAG retrieves the top-k documents by vector similarity and stuffs them all into context. That's... fine for demos. Production context engineering adds a reranking step that scores each retrieved document against the actual query, filtering out noise before it reaches the model:

from sentence_transformers import CrossEncoder
import numpy as np

class ContextSelector:
    """Selects and ranks context for optimal LLM performance."""

    def __init__(self, vector_store, reranker_model="cross-encoder/ms-marco-MiniLM-L-12-v2"):
        self.vector_store = vector_store
        self.reranker = CrossEncoder(reranker_model)
        self.relevance_threshold = 0.3

    def select_context(
        self,
        query: str,
        max_tokens: int = 4000,
        source_filter: dict | None = None,
    ) -> list[dict]:
        """Retrieve, rerank, and budget-fit context for a query."""

        # Step 1: Broad retrieval (cast a wide net)
        candidates = self.vector_store.similarity_search(
            query, k=20, filter=source_filter
        )

        # Step 2: Rerank with cross-encoder for precision
        pairs = [(query, doc.page_content) for doc in candidates]
        scores = self.reranker.predict(pairs)

        scored_docs = sorted(
            zip(candidates, scores),
            key=lambda x: x[1],
            reverse=True,
        )

        # Step 3: Budget-aware selection
        selected = []
        token_budget = max_tokens
        for doc, score in scored_docs:
            if score < self.relevance_threshold:
                break
            doc_tokens = len(doc.page_content.split()) * 1.3  # rough estimate
            if doc_tokens > token_budget:
                continue
            selected.append({
                "content": doc.page_content,
                "score": float(score),
                "source": doc.metadata.get("source", "unknown"),
            })
            token_budget -= doc_tokens

        return selected

The critical addition is the relevance_threshold. Without it, your RAG pipeline will happily inject low-relevance documents into context simply because they were the "best" of a bad set. In production, it's better to return fewer documents than to pollute context with noise. Trust me on this one — I've debugged too many agents where the root cause turned out to be irrelevant RAG results muddying the waters.

Dynamic Tool Selection

Tool definitions consume 200–500 tokens each. An agent with 30 tools burns 6,000–15,000 tokens on tool definitions alone — before a single user message is even processed. Worse, too many tools create ambiguous decision points that degrade tool-calling accuracy.

The solution? RAG-applied-to-tools: retrieve only the tools relevant to the current task. Recent research shows this approach triples tool-calling accuracy while reducing prompt token usage by over 50%:

from pydantic import BaseModel

class ToolSelector:
    """Dynamically selects relevant tools based on the current query."""

    def __init__(self, all_tools: list[dict], embedder):
        self.all_tools = all_tools
        self.embedder = embedder
        # Pre-compute embeddings for tool descriptions
        self.tool_embeddings = embedder.encode([
            f"{t['name']}: {t['description']}" for t in all_tools
        ])

    def select_tools(self, query: str, max_tools: int = 5) -> list[dict]:
        """Select the most relevant tools for a given query."""
        query_embedding = self.embedder.encode([query])
        similarities = np.dot(self.tool_embeddings, query_embedding.T).flatten()

        top_indices = np.argsort(similarities)[-max_tools:][::-1]
        return [self.all_tools[i] for i in top_indices if similarities[i] > 0.25]

This pattern is especially valuable in MCP-heavy architectures where agents connect to multiple tool servers, each exposing dozens of tools.

Strategy 3: Compress — Reduce Context Without Losing Signal

Compression keeps the context window lean by reducing the token count of existing context while preserving the information the agent actually needs. There are two main approaches: summarization (lossy) and truncation with metadata preservation (structural).

Conversation Compaction

Compaction is the practice of summarizing a conversation when it approaches the context window limit, then continuing with that summary. This is the pattern Claude Code uses — it triggers "auto-compact" after exceeding 95% of the context window. Here's a production implementation:

from langchain_core.messages import SystemMessage, HumanMessage, AIMessage

class ConversationCompactor:
    """Compacts conversation history to stay within token budgets."""

    def __init__(self, llm, max_context_tokens: int = 100_000):
        self.llm = llm
        self.max_context_tokens = max_context_tokens
        self.compaction_threshold = 0.80  # trigger at 80% usage

    def should_compact(self, messages: list) -> bool:
        """Check if conversation needs compaction."""
        total_tokens = sum(self._estimate_tokens(m) for m in messages)
        return total_tokens > (self.max_context_tokens * self.compaction_threshold)

    def compact(self, messages: list, preserve_recent: int = 6) -> list:
        """Summarize older messages, keep recent ones intact."""
        if len(messages) <= preserve_recent:
            return messages

        old_messages = messages[:-preserve_recent]
        recent_messages = messages[-preserve_recent:]

        summary = self.llm.invoke([
            SystemMessage(content=(
                "Summarize the following conversation history. "
                "Preserve: all decisions made, action items, key facts, "
                "user preferences, and any unresolved questions. "
                "Discard: greetings, repeated information, "
                "and verbose tool outputs."
            )),
            HumanMessage(content=self._format_messages(old_messages))
        ])

        return [
            SystemMessage(content=(
                f"Previous conversation summary:\n{summary.content}"
            )),
            *recent_messages,
        ]

    def _estimate_tokens(self, message) -> int:
        return len(str(message.content).split()) * 1.3

    def _format_messages(self, messages: list) -> str:
        return "\n".join(
            f"{m.type}: {m.content}" for m in messages
        )

The preserve_recent parameter is crucial. Always keep the last several messages intact — they carry the immediate task context that the model needs for its next response. Compaction should only target older history.

Tool Output Compression

Tool outputs are one of the biggest sources of context bloat in agentic workflows. A database query might return 50 rows when the agent only needs 3. An API call might return a full JSON response when only two fields matter. Compressing tool outputs before they enter the context window is one of the highest-impact optimizations you can make:

class ToolOutputCompressor:
    """Compresses tool outputs before they enter the context window."""

    def __init__(self, llm):
        self.llm = llm
        self.max_output_tokens = 1000

    def compress(self, tool_name: str, raw_output: str, query_context: str) -> str:
        """Compress tool output, keeping only what is relevant to the task."""
        estimated_tokens = len(raw_output.split()) * 1.3

        if estimated_tokens <= self.max_output_tokens:
            return raw_output

        response = self.llm.invoke([
            SystemMessage(content=(
                f"The tool '{tool_name}' returned a large output. "
                "Summarize it, retaining all information relevant to "
                "the user's current task. Discard boilerplate, "
                "redundant entries, and metadata not needed for the "
                "next reasoning step."
            )),
            HumanMessage(content=(
                f"User's task context: {query_context}\n\n"
                f"Tool output:\n{raw_output}"
            ))
        ])
        return response.content

This pattern works exceptionally well for agents that interact with databases, APIs, or file systems — basically anywhere tool outputs can be unpredictably large.

Strategy 4: Isolate — Divide Context Across Multiple Agents

When a single agent's task is complex enough that no amount of compression can keep the context clean, the answer is isolation — splitting the work across multiple specialized sub-agents, each with its own focused context window.

Anthropic's own research confirmed that multiple agents with isolated contexts outperform a single agent trying to juggle everything in one window. This makes intuitive sense: each sub-agent can dedicate its full context to a specific subtask without being distracted by irrelevant information from other parts of the workflow.

Multi-Agent Context Isolation with LangGraph

from langgraph.graph import StateGraph, END

class OrchestratorState(MessagesState):
    """Orchestrator state tracks subtask results, not full sub-agent context."""
    research_summary: str
    code_output: str
    review_result: str

def orchestrator_node(state: OrchestratorState):
    """Routes tasks to specialized sub-agents."""
    messages = state["messages"]
    user_request = messages[-1].content

    # Classify the task (lightweight LLM call)
    task_plan = planner_llm.invoke([
        SystemMessage(content=(
            "Break down the user request into subtasks. "
            "Available agents: researcher, coder, reviewer."
        )),
        HumanMessage(content=user_request)
    ])
    return {"messages": [task_plan]}

def research_agent(state: OrchestratorState):
    """Sub-agent with its own isolated context for research."""
    # This agent gets ONLY the research-relevant context
    task = extract_research_task(state["messages"])
    result = research_llm.invoke([
        SystemMessage(content="You are a research specialist. ..."),
        HumanMessage(content=task)
    ])
    # Return only the summary, not the full research context
    return {"research_summary": result.content}

def code_agent(state: OrchestratorState):
    """Sub-agent with its own isolated context for coding."""
    research = state.get("research_summary", "")
    task = extract_coding_task(state["messages"])
    result = code_llm.invoke([
        SystemMessage(content="You are a coding specialist. ..."),
        HumanMessage(content=(
            f"Research findings:\n{research}\n\nTask:\n{task}"
        ))
    ])
    return {"code_output": result.content}

graph = StateGraph(OrchestratorState)
graph.add_node("orchestrator", orchestrator_node)
graph.add_node("researcher", research_agent)
graph.add_node("coder", code_agent)
graph.add_node("reviewer", review_agent)
# ... add conditional edges based on task plan ...

Here's the key principle: sub-agents return summaries, not their full context. The orchestrator never sees the 20 documents the researcher retrieved or the 15 failed code attempts the coder tried. It only sees the distilled result, keeping its own context clean for high-level reasoning.

Defending Against Context Rot in Production

Context rot is the phenomenon where model performance degrades as the context window fills with accumulated information — even when every individual piece of information is correct and relevant. It's the single most insidious failure mode in production agents, because the agent continues producing confident-sounding outputs that are subtly wrong.

If that sounds scary, it should be.

Three Failure Modes to Watch For

  • Context poisoning: A hallucinated fact from an earlier step makes it into context and gets treated as ground truth in subsequent steps. The agent confidently builds on a fabrication. This is particularly dangerous because the hallucinated content carries the same formatting and confidence markers as legitimate information.
  • Context distraction: Irrelevant but semantically similar content pulls the model's attention away from the actual task. This is the "lost in the middle" phenomenon at scale — the model attends to noisy content near the start or end of context while missing critical information buried in the middle.
  • Context confusion: Contradictory information from different sources (or different points in time) causes the model to produce inconsistent responses. For example, a tool output says the file has 10 lines, while the conversation history says 15 — the model may arbitrarily pick either.

A Context Health Monitor

In production, you should actively monitor the health of your agent's context. Here's a lightweight monitor that tracks token usage and signals when intervention is needed:

import time
from dataclasses import dataclass, field

@dataclass
class ContextHealthMetrics:
    total_tokens: int = 0
    system_tokens: int = 0
    tool_def_tokens: int = 0
    history_tokens: int = 0
    rag_tokens: int = 0
    generation_headroom: int = 0
    utilization_pct: float = 0.0
    tool_output_ratio: float = 0.0
    timestamps: list[float] = field(default_factory=list)

class ContextHealthMonitor:
    """Monitors context window health and triggers interventions."""

    def __init__(self, max_tokens: int = 200_000):
        self.max_tokens = max_tokens
        self.compaction_threshold = 0.75
        self.alert_threshold = 0.90

    def analyze(self, messages: list, tools: list, rag_docs: list) -> ContextHealthMetrics:
        metrics = ContextHealthMetrics()

        metrics.system_tokens = self._count_tokens(
            [m for m in messages if m.type == "system"]
        )
        metrics.tool_def_tokens = sum(
            len(str(t).split()) * 1.3 for t in tools
        )
        metrics.history_tokens = self._count_tokens(
            [m for m in messages if m.type in ("human", "ai")]
        )
        metrics.rag_tokens = sum(
            len(d["content"].split()) * 1.3 for d in rag_docs
        )
        metrics.total_tokens = (
            metrics.system_tokens + metrics.tool_def_tokens
            + metrics.history_tokens + metrics.rag_tokens
        )
        metrics.generation_headroom = self.max_tokens - metrics.total_tokens
        metrics.utilization_pct = metrics.total_tokens / self.max_tokens

        # Track tool output bloat
        tool_outputs = [
            m for m in messages if getattr(m, "type", "") == "tool"
        ]
        if messages:
            metrics.tool_output_ratio = (
                self._count_tokens(tool_outputs) / max(metrics.total_tokens, 1)
            )

        return metrics

    def recommend_action(self, metrics: ContextHealthMetrics) -> str:
        if metrics.utilization_pct >= self.alert_threshold:
            return "COMPACT_NOW"
        if metrics.utilization_pct >= self.compaction_threshold:
            return "COMPACT_SOON"
        if metrics.tool_output_ratio > 0.5:
            return "COMPRESS_TOOL_OUTPUTS"
        return "OK"

    def _count_tokens(self, messages) -> int:
        return int(sum(len(str(m.content).split()) * 1.3 for m in messages))

Integrate this monitor into your agent loop and trigger compaction or tool output compression automatically when thresholds are crossed. The key is acting proactively — by the time the context window hits 95% capacity, performance has already started to degrade.

Putting It All Together: A Context-Engineered Agent Pipeline

Here's how the four strategies combine in a production agent architecture. The following diagram shows the flow of context through each stage:

User Query
    │
    ▼
┌──────────────────────┐
│  Context Health Check │ ──→ Trigger compaction if needed
└──────────┬───────────┘
           │
           ▼
┌──────────────────────┐
│   Tool Selection     │ ──→ RAG over tool descriptions (SELECT)
│   (5 of 30 tools)    │
└──────────┬───────────┘
           │
           ▼
┌──────────────────────┐
│   Memory Retrieval   │ ──→ Semantic search + reranking (SELECT)
│   (top 3 memories)   │
└──────────┬───────────┘
           │
           ▼
┌──────────────────────┐
│   RAG Retrieval      │ ──→ Budget-aware document selection (SELECT)
│   (4000 token budget)│
└──────────┬───────────┘
           │
           ▼
┌──────────────────────┐
│   Assemble Context   │ ──→ System prompt + tools + memory + RAG + history
│   + Token Budget     │
└──────────┬───────────┘
           │
           ▼
┌──────────────────────┐
│   LLM Inference      │
└──────────┬───────────┘
           │
           ▼
┌──────────────────────┐
│  Tool Call?           │──yes──→ Execute tool
│                      │              │
└──────────┬───────────┘              ▼
           │              ┌──────────────────────┐
           │              │ Compress Tool Output  │ (COMPRESS)
           │              └──────────┬────────────┘
           │                         │
           ▼                         ▼
┌──────────────────────┐
│  Write to Scratchpad │ ──→ Persist findings externally (WRITE)
└──────────┬───────────┘
           │
           ▼
┌──────────────────────┐
│  Complex subtask?    │──yes──→ Delegate to sub-agent (ISOLATE)
└──────────┬───────────┘
           │
           ▼
       Response

Each strategy addresses a different part of the context lifecycle. In practice, most production agents need all four — just weighted differently based on their workload. A research agent doing long-horizon analysis leans heavily on write and compress. A customer support agent handling quick queries prioritizes select and isolate.

Production Trade-Offs and Decision Framework

Context engineering involves real trade-offs. There's no magic config that works for every use case. Here's a framework for making the right choices.

When to Use Each Strategy

StrategyBest ForCostRisk
WriteLong-running tasks, multi-step workflowsLow (storage + read I/O)Stale notes if not updated
SelectKnowledge-heavy tasks, large tool setsMedium (retrieval latency)Missing relevant context if threshold is too aggressive
CompressLong conversations, verbose tool outputsMedium (extra LLM call)Information loss during summarization
IsolateComplex, multi-domain tasksHigh (multiple LLM calls)Coordination overhead, lost cross-domain nuance

Token Budget Allocation Guidelines

Based on production experience across multiple agent architectures, here's a starting-point budget allocation for a 200K-token context window:

  • System prompt + tool definitions: 5–10% (10K–20K tokens). Keep system prompts concise, tools minimal.
  • Retrieved context (RAG + memory): 15–25% (30K–50K tokens). Quality over quantity — rerank aggressively.
  • Conversation history: 20–30% (40K–60K tokens). Use compaction to keep this in range.
  • Generation headroom: 35–50% (70K–100K tokens). The model needs room to think. Don't sacrifice this.

That generation headroom number surprises a lot of engineers. But models reason through complex problems by generating intermediate tokens — chain-of-thought, self-correction, structured output formatting. Cramming the context window to 95% capacity leaves no room for any of that, and you'll see quality degrade sharply.

Measuring Context Quality

You can't improve what you don't measure. Track these metrics in your observability stack:

  • Context utilization: What percentage of the context window is used per LLM call? Track percentile distributions (p50, p95), not averages.
  • Retrieval precision: Of the documents injected via RAG, what fraction does the model actually reference in its response? Low precision means your selection is too broad.
  • Compaction frequency: How often does your agent trigger compaction? Frequent compaction suggests the workflow is too chatty or tool outputs are too verbose.
  • Token cost per task: Total input + output tokens per completed user task. This is the number that context engineering directly reduces.

Context Engineering vs. Prompt Engineering: The Relationship

Context engineering isn't a replacement for prompt engineering — it's the next layer up. According to a 2026 State of Context Management Report, 82% of IT and data leaders agree that prompt engineering alone is no longer sufficient to power AI at scale, and 95% of data teams plan to invest in context engineering training during 2026.

The relationship is hierarchical:

  • Prompt engineering optimizes the instructions — the system prompt, few-shot examples, and output formatting directives. It operates at write time and focuses on phrasing.
  • Context engineering optimizes the entire information environment — what documents are retrieved, which tools are loaded, how history is compressed, and how sub-agents share information. It operates at runtime and focuses on architecture.

A well-engineered prompt in a poorly-engineered context will underperform. A mediocre prompt in a well-engineered context will often succeed. The context sets the ceiling; the prompt determines how close you get to it.

Frequently Asked Questions

Does a larger context window eliminate the need for context engineering?

No — and this is probably the most common misconception out there. Research consistently shows that model performance degrades as context length increases, even within the advertised window. The Chroma study found that all 18 tested frontier models performed worse with more input. A 2-million-token window doesn't mean you should use 2 million tokens — it means you have more room to be strategic about what you include. Context engineering becomes more important with larger windows, not less, because the cost of wasted tokens scales proportionally.

When should I use compaction versus a multi-agent architecture?

Use compaction when the task is inherently sequential — one conversation thread that grows over time (customer support, iterative coding). Use multi-agent isolation when the task is parallel or multi-domain — different aspects of the work require fundamentally different context. If you find yourself compacting context that the agent will need again later, that's a signal to switch to isolation or external note-taking instead.

How does context engineering interact with RAG?

RAG is one technique within context engineering — specifically the "select" strategy. But context engineering goes well beyond RAG to include how you manage conversation history, tool definitions, memory, and the structural layout of the entire prompt. Many teams optimize their RAG pipeline extensively while neglecting tool output compression and conversation management, which (in my experience) often contribute more to context bloat in agentic workflows.

What is the biggest context engineering mistake in production?

The most common mistake is treating the context window as unlimited and appending everything — every tool output, every retrieved document, every conversation turn — without any curation. Teams usually discover this failure when their agent starts producing confident but incorrect answers after 15–20 tool calls.

The second most common mistake? Compacting too aggressively, losing subtle but critical context that the agent needs for later reasoning steps. The art is finding the right balance — and honestly, that balance is different for every agent.

Can I use context engineering with any LLM provider?

Yes. Context engineering is provider-agnostic — it operates at the application layer, above the LLM API. The same strategies (write, select, compress, isolate) work with OpenAI, Anthropic, Google, Mistral, and local models. The main provider-specific consideration is the context window size, which determines your token budget constraints and compaction thresholds. Use a framework like LangChain or LiteLLM to abstract provider differences so your context engineering logic doesn't need to change when you swap models.

About the Author Editorial Team

Our team of expert writers and editors.