Prompt Caching in Production: How to Cut Claude, OpenAI, and Gemini Costs by 90%

A production guide to prompt caching with Claude, OpenAI, and Gemini. Learn cache breakpoints, TTL strategy, prompt structure, and the seven mistakes that silently kill your cache hit rate.

If your monthly LLM bill has crept past five figures and you’re still shipping the same system prompt, tool definitions, or document context with every single request… you’re basically paying to reprocess identical tokens over and over. Prompt caching is hands-down the highest-leverage cost lever available in 2026 — it can slash input token costs by up to 90% and cut latency by around 85%. And yet, most teams either implement it incorrectly or miss it entirely.

So, let’s dive in. This guide covers production prompt caching with Claude (Anthropic), GPT (OpenAI), and Gemini (Google) as they stand today. You’ll learn exactly where to place cache breakpoints, how to structure prompts for maximum hit rate, how to actually monitor the thing in prod, and the seven most common mistakes that quietly wreck your cache ratio.

What Is Prompt Caching and How Does It Actually Work?

Prompt caching is a native API feature that stores the key-value (KV) attention state of your prompt prefix on the provider’s infrastructure. When a subsequent request arrives with an identical prefix, the model skips the prefill phase and reuses those cached activations. The result: dramatically lower cost and noticeably faster time-to-first-token.

Quick clarification, because this trips people up: prompt caching is not the same as semantic caching. Semantic caching matches requests by embedding similarity and returns a cached response. Prompt caching operates at the token level on the input side — the model still generates a fresh completion, it just doesn’t re-process the cached prefix tokens.

The Three Pricing Tiers

  • Cache write: Usually 1.25x base input price (Claude) or equal to base price (OpenAI) — the cost of creating the cache entry.
  • Cache read: 0.1x base input price (Claude) or 0.5x base input price (OpenAI) — the cost of reading from an existing cache.
  • Cache miss: Standard input price — what you pay when the prefix doesn’t match.

With Claude Sonnet 4.6 at $3/MTok input, a cached read drops to $0.30/MTok — a clean 10x reduction. Over 100,000 daily requests with a 50K-token system prompt, that’s the difference between $15,000/month and $1,500/month. Honestly, once you run those numbers, it’s kind of hard to justify not doing this.

Claude Prompt Caching: The cache_control Parameter

Claude uses explicit cache breakpoints. You mark specific content blocks with cache_control, and Anthropic caches the prefix up to and including that block. You get up to four cache breakpoints per request, which lets you layer caches at different granularities (more on that in a sec).

Basic Example: Caching a Large System Prompt

import anthropic

client = anthropic.Anthropic()

LEGAL_CONTEXT = open("legal_corpus.txt").read()  # ~80,000 tokens

response = client.messages.create(
    model="claude-sonnet-4-6",
    max_tokens=1024,
    system=[
        {
            "type": "text",
            "text": "You are a legal research assistant. Cite every claim with [SX]."
        },
        {
            "type": "text",
            "text": LEGAL_CONTEXT,
            "cache_control": {"type": "ephemeral"}
        }
    ],
    messages=[
        {"role": "user", "content": "Summarize the force majeure clause."}
    ]
)

print(response.usage)
# First call:  cache_creation_input_tokens=80234, cache_read_input_tokens=0
# Second call: cache_creation_input_tokens=0, cache_read_input_tokens=80234

The Five-Minute TTL and the 1-Hour Option

By default, Claude cache entries have a 5-minute TTL, refreshed on every hit. For workloads with longer gaps between requests, there’s an extended 1-hour cache option (2x write cost, but it persists):

{
    "type": "text",
    "text": LEGAL_CONTEXT,
    "cache_control": {"type": "ephemeral", "ttl": "1h"}
}

Rule of thumb I’ve settled on: pick the 1-hour TTL when batch jobs or user sessions have 5–60 minute gaps. Stick with the default 5-minute TTL for chatbots, agent loops, or high-frequency workloads where hits arrive every few seconds anyway.

Layering Cache Breakpoints

The four-breakpoint limit is a feature, not a limitation. Use it to cache different layers with different invalidation frequencies:

system=[
    # Breakpoint 1: Rarely changes (tool definitions)
    {"type": "text", "text": TOOL_SCHEMAS, "cache_control": {"type": "ephemeral"}},
    # Breakpoint 2: Per-tenant context (hours/days)
    {"type": "text", "text": tenant_policies, "cache_control": {"type": "ephemeral"}},
    # Breakpoint 3: Per-session user context (minutes)
    {"type": "text", "text": user_profile, "cache_control": {"type": "ephemeral"}},
    # Breakpoint 4: Conversation so far
    {"type": "text", "text": conversation_summary, "cache_control": {"type": "ephemeral"}},
]

When the conversation summary changes, only the fourth breakpoint invalidates — the first three still produce cache hits. This kind of layering is genuinely where the big wins come from.

OpenAI Prompt Caching: Automatic, With Caveats

OpenAI’s prompt caching is automatic for prompts longer than 1,024 tokens. There’s no cache_control parameter to set (which is nice, until it isn’t). The cache matches in 128-token increments, so your prefix must align identically up to the next 128-token boundary.

from openai import OpenAI

client = OpenAI()

response = client.chat.completions.create(
    model="gpt-5",
    messages=[
        {"role": "system", "content": LARGE_SYSTEM_PROMPT},  # 5000 tokens
        {"role": "user", "content": user_query}
    ]
)

# Access cache usage
cached = response.usage.prompt_tokens_details.cached_tokens
total = response.usage.prompt_tokens
hit_rate = cached / total if total else 0
print(f"Cache hit rate: {hit_rate:.1%}")

The Critical Rule: Put Variable Content at the End

OpenAI caches prefixes. If you inject the user’s query or a timestamp at the top of the system prompt, you just destroyed the cache for every request. Ask me how I know. The correct structure:

# WRONG: Timestamp in system prompt invalidates every cache
system = f"Current time: {datetime.now()}. You are a helpful assistant. {BIG_CONTEXT}"

# RIGHT: Static content first, dynamic content in user message
system = f"You are a helpful assistant. {BIG_CONTEXT}"
user = f"Current time: {datetime.now()}. {user_query}"

Cache Scoping by Tenant

OpenAI supports a prompt_cache_key parameter to explicitly scope cache entries per tenant and prevent cross-tenant cache sharing. It also tends to improve hit rates when you have many users with distinct prompt prefixes:

response = client.chat.completions.create(
    model="gpt-5",
    messages=messages,
    prompt_cache_key=f"tenant:{tenant_id}"
)

Gemini Context Caching: Explicit Cache Objects

Google Gemini takes a different approach: you create a named cache object via the API, then reference it in subsequent requests. This is ideal for very large contexts you plan to reuse many times — think entire codebases, long videos, multi-hour transcripts:

from google import genai
from google.genai.types import CreateCachedContentConfig

client = genai.Client()

cache = client.caches.create(
    model="gemini-2.5-pro",
    config=CreateCachedContentConfig(
        contents=[large_document],
        system_instruction="Answer questions from this document only.",
        ttl="3600s"
    )
)

response = client.models.generate_content(
    model="gemini-2.5-pro",
    contents="What are the key findings in Chapter 4?",
    config={"cached_content": cache.name}
)

Structuring Prompts for Maximum Cache Hit Rate

The cardinal rule across all providers: static content first, dynamic content last. Here’s the canonical ordering from most-static to most-dynamic:

  1. Role and persona instructions
  2. Tool and function schemas
  3. Few-shot examples
  4. Large reference documents (RAG context, knowledge base chunks)
  5. Conversation history
  6. Current user query
  7. Timestamps, request IDs, session data

Pitfall: Changing Whitespace or Token Order

Cache matching is byte-exact at the tokenizer level. A single extra space, a different punctuation mark, even a reordered JSON key in a tool schema will miss the cache. Normalize your prompt assembly through a single function and treat it as immutable once deployed:

def build_system_prompt(context: str) -> str:
    # Always sort keys, always same whitespace, always same order
    return SYSTEM_TEMPLATE.format(context=context.strip())

Monitoring Cache Hit Rate in Production

Log cache metrics on every request and alert when hit rate drops below your baseline. A sudden drop almost always means a code change invalidated the prefix — it’s one of those bugs that’s invisible until you see the bill. Here’s a production-ready wrapper I’ve used across a few projects:

import logging
from dataclasses import dataclass
from prometheus_client import Counter, Histogram

cache_tokens = Counter("llm_cache_tokens_total", "Cached tokens", ["provider", "type"])
request_cost = Histogram("llm_request_cost_usd", "Request cost", ["provider"])

@dataclass
class CacheMetrics:
    cache_read: int
    cache_write: int
    uncached_input: int
    output: int

def track_claude_usage(usage, model: str) -> CacheMetrics:
    m = CacheMetrics(
        cache_read=usage.cache_read_input_tokens or 0,
        cache_write=usage.cache_creation_input_tokens or 0,
        uncached_input=usage.input_tokens,
        output=usage.output_tokens,
    )
    cache_tokens.labels("claude", "read").inc(m.cache_read)
    cache_tokens.labels("claude", "write").inc(m.cache_write)

    # Cost in USD (Sonnet 4.6 rates)
    cost = (
        m.cache_read * 0.30 / 1_000_000
        + m.cache_write * 3.75 / 1_000_000
        + m.uncached_input * 3.00 / 1_000_000
        + m.output * 15.00 / 1_000_000
    )
    request_cost.labels("claude").observe(cost)

    total_input = m.cache_read + m.uncached_input
    hit_rate = m.cache_read / total_input if total_input else 0
    if hit_rate < 0.50 and total_input > 5000:
        logging.warning(f"Low cache hit rate: {hit_rate:.1%} on {model}")
    return m

The Seven Most Common Prompt Caching Mistakes

  1. Dynamic content at the prompt top — timestamps, request IDs, or rotating greetings placed in the system prompt invalidate every cache hit. Classic rookie move (I’ve done it).
  2. Unsorted JSON tool schemas — Python dicts are insertion-ordered, but different serialization paths can produce different byte output. Always json.dumps(obj, sort_keys=True).
  3. Chunked RAG context in random order — if you sort retrieved chunks by score on every request, minor score shifts reorder chunks and kill the cache. Pin to a stable ordering (doc ID) within the same session.
  4. Ignoring the 1,024-token minimum (OpenAI) — prompts under the threshold are never cached, so small prompts get zero benefit.
  5. Not refreshing before TTL — if traffic pauses for 6 minutes, the 5-minute Claude cache expires and the next request pays full write cost all over again. Use the 1-hour TTL or send a keepalive.
  6. Caching volatile data — caching a “top 10 trending” list that changes every minute just wastes money on write costs that never amortize.
  7. Cross-tenant prompt prefixes without scoping — tenants with different data mixed into a shared prefix cause cache thrashing; scope with prompt_cache_key or tenant-specific breakpoints.

When Prompt Caching Does NOT Pay Off

Caching has a break-even point. For Claude, a cache write costs 1.25x standard input; a cache read costs 0.1x. You need at least two hits on the same prefix for caching to actually save money on a given entry. Workloads where caching hurts:

  • One-shot requests with unique context per call (document Q&A where every doc is different)
  • Prompts below the minimum cacheable length (1,024 tokens for Anthropic and OpenAI)
  • Workloads with fewer than 2 requests per TTL window on the same prefix

Putting It Together: A Production RAG Example

Here’s a realistic structure for a customer-support RAG agent with layered caching:

def build_support_request(tenant_id, session_id, user_query, retrieved_chunks):
    # Pin chunk order by doc_id to keep cache stable
    retrieved_chunks.sort(key=lambda c: c["doc_id"])
    context = "\n\n".join(c["text"] for c in retrieved_chunks)

    return {
        "model": "claude-sonnet-4-6",
        "max_tokens": 1024,
        "system": [
            # Layer 1: Static org-wide instructions (invalidates monthly)
            {"type": "text", "text": SYSTEM_INSTRUCTIONS,
             "cache_control": {"type": "ephemeral", "ttl": "1h"}},
            # Layer 2: Tool schemas (invalidates on deploy)
            {"type": "text", "text": TOOL_SCHEMAS_JSON,
             "cache_control": {"type": "ephemeral", "ttl": "1h"}},
            # Layer 3: Tenant KB snapshot (invalidates nightly)
            {"type": "text", "text": f"tenant={tenant_id}\n{context}",
             "cache_control": {"type": "ephemeral", "ttl": "1h"}},
        ],
        "messages": [
            {"role": "user", "content": user_query}
        ]
    }

With 10,000 support queries per day and an average 30,000-token context, moving from no caching to this layered structure takes input token cost from roughly $900/day down to about $95/day. That’s a 10.5x savings — enough to pay for a small engineer’s time several times over.

Frequently Asked Questions

How long does prompt caching last?

Claude caches for 5 minutes by default, refreshed on each read, or 1 hour with the extended TTL option. OpenAI caches are automatic and typically persist 5–60 minutes depending on traffic. Gemini cache TTL is explicitly set when you create the cache object — anywhere from seconds to hours.

Does prompt caching affect model output quality?

Nope. Caching stores only the KV attention state of the input; the model still generates output fresh for each request. Responses are identical to what you’d get without caching — only the cost and latency change.

What’s the difference between prompt caching and semantic caching?

Prompt caching is provider-side, token-level, and caches the input prefix so the model skips prefill. Semantic caching is application-side, embedding-level, and returns a previously generated response when a similar query arrives. They solve different problems, and they actually stack well together.

Can I use prompt caching with streaming responses?

Yes. Both Claude and OpenAI support prompt caching with streaming. Cache usage metrics arrive in the final message_stop event for Claude, and in the usage chunk for OpenAI streams with stream_options={"include_usage": True}.

How do I measure if prompt caching is saving me money?

Track three metrics per request: cache_read_input_tokens, cache_creation_input_tokens, and uncached input tokens. Multiply each by its respective price and sum. Compare that against the hypothetical cost of all tokens at the standard rate. A healthy production workload should show 70–95% of input tokens served from cache once steady state is reached.

Does prompt caching work across different models or API endpoints?

No — and this one bites people during upgrades. Cache entries are scoped to a specific model version. Switching from claude-sonnet-4-6 to claude-opus-4-7 will miss all existing cache entries. Plan model upgrades during low-traffic windows and expect a temporary cost spike while caches rebuild.

About the Author Editorial Team

Our team of expert writers and editors.