LLM Cost Optimization: Semantic Caching, Model Routing, and Token Management

A practical guide to cutting LLM API costs by 60-88% using four optimization layers: semantic caching with Redis, intelligent model routing, provider-level prompt caching, and batch processing with production code examples.

The Hidden Tax on AI: Why Most Teams Overpay for LLM Inference

Running large language models in production is expensive — and honestly, it keeps getting worse as applications scale. A recent industry survey found that 53% of AI teams experience costs exceeding their forecasts by 40% or more during scaling. The culprits? Inefficient infrastructure and unmonitored token consumption. For medium-sized applications, inference demands from complex prompts can drive daily expenses to between $3,000 and $6,000 for every 10,000 user sessions.

But here's the thing — most of that spend is unnecessary.

Production teams that systematically apply cost optimization techniques (semantic caching, intelligent model routing, prompt compression, batch processing, and provider-level caching) routinely achieve 60–80% cost reduction without sacrificing output quality. Some workloads see savings north of 90%, which still surprises me every time I see the numbers.

This guide is the practical blueprint. We'll walk through every major optimization lever available in 2026, with real code, architecture patterns, and concrete numbers. Whether you're spending $500 or $50,000 a month on LLM APIs, you'll find techniques here that pay for themselves immediately.

Understanding the Cost Anatomy of an LLM API Call

Before you start optimizing, you need to understand what you're actually paying for. Every LLM API call has a cost structure with several components, and each one is a potential optimization target.

Input Tokens vs. Output Tokens

Here's the most critical pricing detail that many teams miss: output tokens cost 3–10x more than input tokens. For Claude Sonnet 4.5, input tokens cost $3 per million while output tokens cost $15 per million — a 5x multiplier. For GPT-4o, the ratio is $2.50 input vs. $10 output — a 4x multiplier.

This asymmetry means that optimizing output length is often more impactful than optimizing input length. A system prompt that says "Be concise. Respond in 2-3 sentences unless the user explicitly asks for detail" can reduce output tokens by 40-60% with minimal quality impact for many use cases. That's a surprisingly easy win.

The Four Optimization Layers

Cost optimization happens across four layers, each compounding on the others:

  1. Caching Layer — Avoid making API calls entirely when possible
  2. Routing Layer — Send each request to the cheapest model that meets the quality bar
  3. Prompt Layer — Minimize tokens sent and received per request
  4. Batch Layer — Use asynchronous processing for non-real-time workloads

Applied together, these layers create compound savings. A request that hits the semantic cache costs zero. A request routed to a smaller model costs 10-50x less. A request with a compressed prompt costs 30-50% less. A request batched asynchronously costs 50% less. Stack them all up, and the math gets very compelling.

Layer 1: Semantic Caching — Eliminating Redundant Calls

The highest-impact optimization is also the simplest conceptually: don't call the LLM if you've already answered a semantically equivalent question.

Traditional exact-match caching misses most opportunities because users ask the same thing in different ways. "What's your return policy?", "How do I return something?", and "Can I send this back?" are all the same question — but they share zero words in common. I've seen teams dismiss caching entirely because their exact-match cache hit rate was under 5%, not realizing that semantic caching would have caught 60%+ of those queries.

Semantic caching solves this by comparing the meaning of queries using vector embeddings. When a new query arrives, the system embeds it, searches for similar cached queries, and returns the cached response if the similarity score exceeds a threshold — all without touching the LLM.

How Semantic Caching Works

The architecture is straightforward:

  1. User query arrives
  2. Generate an embedding vector for the query
  3. Search the vector store for similar cached queries (cosine similarity > threshold)
  4. If a match is found, return the cached response (cache hit — cost: near zero)
  5. If no match, forward to the LLM, cache the query-response pair, return the response

Cache hit rates range from 60–85% in workloads with strong semantic repetition (customer support, documentation queries, FAQ bots), reducing API calls by up to 68.8%. For cached responses, latency drops from approximately 1.67 seconds to 0.052 seconds — a 96.9% reduction. Your users notice that kind of speed improvement.

Building a Semantic Cache with Redis

Redis has emerged as the dominant infrastructure for semantic caching in 2026, with its RedisVL library providing a purpose-built SemanticCache class. Here's a production-ready implementation:

import os
import hashlib
from redisvl.extensions.llmcache import SemanticCache
from openai import OpenAI

# Initialize the semantic cache with Redis
cache = SemanticCache(
    name="llm_cache",
    redis_url="redis://localhost:6379",
    distance_threshold=0.15,  # Lower = stricter matching
    ttl=3600,  # Cache entries expire after 1 hour
)

client = OpenAI()

def cached_llm_call(prompt: str, system_prompt: str = "", model: str = "gpt-4o") -> str:
    """Make an LLM call with semantic caching."""
    
    # Check the semantic cache first
    cached = cache.check(prompt=prompt)
    if cached:
        print(f"Cache HIT — saved ~${estimate_cost(prompt, cached[0]['response'], model):.4f}")
        return cached[0]["response"]
    
    # Cache miss — call the LLM
    messages = []
    if system_prompt:
        messages.append({"role": "system", "content": system_prompt})
    messages.append({"role": "user", "content": prompt})
    
    response = client.chat.completions.create(
        model=model,
        messages=messages,
    )
    
    result = response.choices[0].message.content
    
    # Store in cache for future hits
    cache.store(prompt=prompt, response=result)
    
    return result


def estimate_cost(prompt: str, response: str, model: str) -> float:
    """Rough cost estimate based on character count."""
    input_tokens = len(prompt) / 4
    output_tokens = len(response) / 4
    
    rates = {
        "gpt-4o": (2.50, 10.00),
        "gpt-4o-mini": (0.15, 0.60),
        "claude-sonnet-4-5": (3.00, 15.00),
    }
    input_rate, output_rate = rates.get(model, (2.50, 10.00))
    return (input_tokens * input_rate + output_tokens * output_rate) / 1_000_000

Redis LangCache: Fully Managed Semantic Caching

For teams that don't want to manage their own caching infrastructure (and let's be honest, that's most teams), Redis launched LangCache in late 2025 — a fully managed semantic caching service available on Redis Cloud. LangCache sits between your application and any LLM provider, automatically generating embeddings and searching for matching cached responses. Reported results include up to 15x faster responses and up to 70% reduction in token usage and API bills.

Cache Invalidation Strategies

Ah, cache invalidation — the classic computer science headache. Your strategy here depends on the nature of your data:

  • TTL-based expiry: Set a time-to-live appropriate for your domain. Product prices might need 1-hour TTL; general knowledge can cache for days.
  • Event-driven invalidation: When underlying data changes (product catalog update, policy change), invalidate related cache entries. This takes more work to set up but prevents stale answers.
  • Session isolation: In multi-tenant applications, ensure User A's cached responses don't leak to User B. Namespace your cache keys by user ID or tenant ID. This one is non-negotiable for any serious production system.
  • Confidence thresholds: Only cache responses above a quality threshold. If your LLM includes a confidence score, skip caching for low-confidence answers.

Layer 2: Intelligent Model Routing — Right-Sizing Every Request

Not every query needs a frontier model. The question "What time does the store close?" doesn't require Claude Opus — Claude Haiku handles it perfectly at 50x lower cost. Intelligent model routing selects the cheapest model that meets the quality bar for each request, and in practice, this single technique can cut costs by 50-87%.

Routing vs. Cascading vs. Cascade Routing

Research from ETH Zurich (published at ICLR 2025) formalized three approaches to model selection:

  • Routing: A classifier predicts which single model is best for each query and sends it directly. Fast, single-hop, but requires a good classifier.
  • Cascading: Start with the cheapest model. If its response doesn't meet a quality threshold, escalate to the next model. Lower cost on average, but higher latency for hard queries.
  • Cascade Routing (unified): Combines both paradigms — route easy queries directly to small models, cascade uncertain queries through increasingly capable models. This approach consistently outperforms either strategy alone.

Google Research extended this further with Speculative Cascades, merging cascading with speculative decoding for even better speed-cost tradeoffs. The field is moving fast here.

Building a Practical Model Router

Here's a production-ready model router that classifies query complexity and routes accordingly. It's deliberately simple — you can always add more sophistication later, but I've found that even basic heuristics capture most of the value:

import re
from dataclasses import dataclass
from enum import Enum
from typing import Optional

class Complexity(Enum):
    SIMPLE = "simple"
    MODERATE = "moderate"
    COMPLEX = "complex"

@dataclass
class ModelConfig:
    name: str
    cost_per_1k_input: float
    cost_per_1k_output: float
    max_tokens: int

# Define your model tier
MODELS = {
    Complexity.SIMPLE: ModelConfig("claude-haiku-4-5", 0.001, 0.005, 8192),
    Complexity.MODERATE: ModelConfig("claude-sonnet-4-5", 0.003, 0.015, 8192),
    Complexity.COMPLEX: ModelConfig("claude-opus-4-6", 0.015, 0.075, 8192),
}

# Complexity signals — extend these with your domain knowledge
COMPLEX_SIGNALS = [
    r"(?i)(analyze|compare|evaluate|synthesize|design|architect)",
    r"(?i)(step.by.step|detailed|comprehensive|in.depth)",
    r"(?i)(code review|debug|refactor|optimize)",
    r"(?i)(pros and cons|trade.?offs|advantages|disadvantages)",
]

SIMPLE_SIGNALS = [
    r"(?i)^(what is|who is|when did|where is|how many)",
    r"(?i)(yes or no|true or false)",
    r"(?i)(define|list|name|spell)",
    r"(?i)(translate|convert|format)",
]


def classify_complexity(query: str, context_tokens: int = 0) -> Complexity:
    """Classify query complexity based on linguistic signals and context size."""
    
    # Large context windows suggest complex tasks
    if context_tokens > 50_000:
        return Complexity.COMPLEX
    
    # Check for complexity signals
    complex_score = sum(1 for p in COMPLEX_SIGNALS if re.search(p, query))
    simple_score = sum(1 for p in SIMPLE_SIGNALS if re.search(p, query))
    
    # Multi-part questions are usually more complex
    question_marks = query.count("?")
    if question_marks > 2:
        complex_score += 2
    
    # Long queries tend to be more complex
    word_count = len(query.split())
    if word_count > 100:
        complex_score += 1
    elif word_count < 20:
        simple_score += 1
    
    if complex_score > simple_score:
        return Complexity.COMPLEX
    elif simple_score > complex_score:
        return Complexity.SIMPLE
    else:
        return Complexity.MODERATE


def route_request(query: str, context_tokens: int = 0) -> ModelConfig:
    """Route a request to the most cost-effective model."""
    complexity = classify_complexity(query, context_tokens)
    model = MODELS[complexity]
    print(f"Routing to {model.name} (complexity: {complexity.value})")
    return model

LiteLLM: The Production Gateway

For production deployments, LiteLLM has become the standard LLM gateway, providing a unified interface to 100+ models with built-in routing, load balancing, fallbacks, and cost tracking. It delivers 8ms P95 latency at 1,000 requests per second — which is more than enough for most production workloads.

Here's a LiteLLM configuration that implements tiered routing with automatic fallbacks:

# litellm_config.yaml
model_list:
  # Primary tier — fast and cheap
  - model_name: fast-tier
    litellm_params:
      model: anthropic/claude-haiku-4-5
      api_key: os.environ/ANTHROPIC_API_KEY
      rpm: 1000
      tpm: 100000
      order: 1

  # Standard tier — balanced
  - model_name: standard-tier
    litellm_params:
      model: anthropic/claude-sonnet-4-5
      api_key: os.environ/ANTHROPIC_API_KEY
      rpm: 500
      tpm: 200000
      order: 1

  # Premium tier — maximum capability
  - model_name: premium-tier
    litellm_params:
      model: anthropic/claude-opus-4-6
      api_key: os.environ/ANTHROPIC_API_KEY
      rpm: 100
      tpm: 200000
      order: 1

  # Fallback to OpenAI if Anthropic is down
  - model_name: standard-tier
    litellm_params:
      model: openai/gpt-4o
      api_key: os.environ/OPENAI_API_KEY
      order: 2

litellm_settings:
  # Automatic fallbacks on failure
  fallbacks:
    - premium-tier: ["standard-tier"]
    - standard-tier: ["fast-tier"]
  
  # Handle context window overflow
  context_window_fallbacks:
    - fast-tier: ["standard-tier"]
  
  # Retry policy per error type
  retry_policy:
    RateLimitErrorRetries: 3
    TimeoutErrorRetries: 2
    InternalServerErrorRetries: 2

router_settings:
  routing_strategy: "usage-based-routing"
  enable_pre_call_checks: true
  redis_host: "localhost"
  redis_port: 6379

This configuration gives you provider-level redundancy (Anthropic primary, OpenAI fallback), automatic escalation when smaller models hit context limits, usage-based routing to spread load across deployments, and Redis-backed state sharing for multi-instance setups.

Layer 3: Prompt Optimization — Doing More with Fewer Tokens

Every token costs money. Prompt optimization reduces the tokens you send and receive without degrading output quality. The techniques range from simple text hygiene to sophisticated provider-level caching.

Provider-Level Prompt Caching

Both Anthropic and OpenAI offer prompt caching at the API level — and this is a fundamentally different mechanism from the semantic caching we discussed earlier. Provider-level prompt caching reuses the computational work of processing the same prompt prefix across requests, rather than reusing responses.

Anthropic's implementation is particularly powerful. When you mark portions of your prompt with cache_control, subsequent requests with the same prefix pay only 10% of the normal input token price for cached portions — a 90% discount. The cache write costs 25% extra (or 100% extra for 1-hour TTL), but you break even after just 2 requests. After that, it's pure savings.

import anthropic

client = anthropic.Anthropic()

# The system prompt and product catalog are cached — only paid in full once
response = client.messages.create(
    model="claude-sonnet-4-5-20250514",
    max_tokens=1024,
    system=[
        {
            "type": "text",
            "text": "You are a helpful customer support agent for Acme Corp...",
        },
        {
            "type": "text",
            "text": PRODUCT_CATALOG,  # 50,000 tokens of product data
            "cache_control": {"type": "ephemeral"},  # 5-minute TTL
        },
    ],
    messages=[
        {"role": "user", "content": "What's the price of Widget X?"}
    ],
)

# Check cache performance in the response
usage = response.usage
print(f"Input tokens: {usage.input_tokens}")
print(f"Cache creation tokens: {usage.cache_creation_input_tokens}")
print(f"Cache read tokens: {usage.cache_read_input_tokens}")

# First call:  cache_creation = 50,000 tokens (1.25x cost)
# Second call: cache_read = 50,000 tokens (0.1x cost) — 90% savings

For a system processing 1,000 queries per hour against a 50,000-token product catalog, the savings are dramatic. Without caching, you pay for 50 million input tokens per hour. With prompt caching, you pay full price once and 10% thereafter — reducing input costs by roughly 89.9%.

Token Reduction Techniques

Beyond provider caching, several techniques reduce raw token count:

  • Prompt compression: Remove filler words, redundant instructions, and verbose examples. "Please provide a detailed and comprehensive analysis of the following text, making sure to cover all important aspects" becomes "Analyze this text. Cover key aspects." — same output quality, 70% fewer input tokens. Every word you cut gets multiplied across thousands of requests.
  • Structured output constraints: When you only need specific fields, use JSON mode or structured outputs to prevent the model from generating verbose prose. A classification task that returns {"category": "billing", "confidence": 0.95} uses far fewer output tokens than a paragraph explanation.
  • Context window management: In multi-turn conversations, summarize older turns rather than carrying the full history. Keep the last 3-5 turns verbatim and summarize everything before that into a concise context paragraph.
  • System prompt optimization: Your system prompt is sent with every request. Every word in it gets multiplied by your total request volume. A 500-token system prompt at 10,000 requests/day costs 5 million tokens/day just in system prompt overhead. Compress it ruthlessly — this is one place where brevity really pays off.

Output Token Management

Remember: output tokens cost 3-10x more than input tokens. So controlling output length is disproportionately valuable:

# Anti-pattern: unbounded output
response = client.chat.completions.create(
    model="gpt-4o",
    messages=[{"role": "user", "content": "Summarize this article"}],
    max_tokens=4096,  # Model may generate up to 4K tokens
)

# Better: constrained output with explicit length guidance
response = client.chat.completions.create(
    model="gpt-4o",
    messages=[
        {
            "role": "system",
            "content": "Respond concisely. Use bullet points. Max 3 sentences per point.",
        },
        {"role": "user", "content": "Summarize this article"},
    ],
    max_tokens=512,  # Hard cap at 512 tokens
)

Layer 4: Batch Processing — The 50% Discount You're Probably Not Using

Both OpenAI and Anthropic offer 50% discounts on all tokens processed through their batch APIs. The tradeoff is that results are delivered asynchronously within a 24-hour window (though most batches complete in under an hour). For any workload that doesn't require real-time responses, this is basically free money sitting on the table.

When to Use Batch Processing

Batch processing is ideal for:

  • Content generation pipelines: Blog posts, product descriptions, email campaigns
  • Data processing: Classification, extraction, summarization of large datasets
  • Evaluation pipelines: Running LLM-as-judge evaluations across test sets
  • Embedding generation: Vectorizing large document collections
  • Nightly reports: Generating daily analytics summaries, digest emails

If you're doing any of these in real-time right now, you're leaving money on the table.

Anthropic Message Batches

Anthropic's Message Batches API accepts up to 10,000 requests per batch. Here's how to use it:

import anthropic
import time

client = anthropic.Anthropic()

# Prepare batch requests
requests = []
for i, document in enumerate(documents):
    requests.append({
        "custom_id": f"doc-{i}",
        "params": {
            "model": "claude-sonnet-4-5-20250514",
            "max_tokens": 256,
            "messages": [
                {
                    "role": "user",
                    "content": f"Classify this support ticket into one of: "
                               f"billing, technical, account, other.\n\n{document}",
                }
            ],
        },
    })

# Submit the batch
batch = client.messages.batches.create(requests=requests)
print(f"Batch {batch.id} submitted with {len(requests)} requests")

# Poll for completion
while True:
    status = client.messages.batches.retrieve(batch.id)
    if status.processing_status == "ended":
        break
    print(f"Status: {status.processing_status} "
          f"({status.request_counts.succeeded}/{status.request_counts.processing})")
    time.sleep(30)

# Retrieve results
results = []
for result in client.messages.batches.results(batch.id):
    results.append({
        "id": result.custom_id,
        "classification": result.result.message.content[0].text,
    })

print(f"Processed {len(results)} documents at 50% discount")

Here's something a lot of people overlook: the batch discount stacks with prompt caching. If your batch requests share a common prefix (system prompt + few-shot examples), you get the 50% batch discount and the 90% cache read discount on the cached portion — a combined savings of up to 95% compared to standard real-time pricing. That's not a typo.

Putting It All Together: The Cost-Optimized LLM Pipeline

So, let's see how all four layers combine into a production pipeline. Each layer reduces load and cost for the layers beneath it:

import anthropic
from redisvl.extensions.llmcache import SemanticCache
from dataclasses import dataclass
from typing import Optional

@dataclass
class LLMResponse:
    content: str
    model_used: str
    cached: bool
    estimated_cost: float

class CostOptimizedPipeline:
    """Production LLM pipeline with four optimization layers."""
    
    def __init__(self):
        self.client = anthropic.Anthropic()
        self.cache = SemanticCache(
            name="production_cache",
            redis_url="redis://localhost:6379",
            distance_threshold=0.15,
            ttl=3600,
        )
        self.request_count = 0
        self.cache_hits = 0
        self.total_savings = 0.0
    
    def query(
        self,
        prompt: str,
        system_prompt: str = "",
        require_premium: bool = False,
        context_tokens: int = 0,
    ) -> LLMResponse:
        """Process a query through the full optimization pipeline."""
        
        # Layer 1: Semantic Cache
        cached = self.cache.check(prompt=prompt)
        if cached:
            self.cache_hits += 1
            return LLMResponse(
                content=cached[0]["response"],
                model_used="cache",
                cached=True,
                estimated_cost=0.0001,  # Embedding cost only
            )
        
        # Layer 2: Model Routing
        if require_premium:
            model = "claude-opus-4-6"
        else:
            complexity = classify_complexity(prompt, context_tokens)
            model_config = MODELS[complexity]
            model = model_config.name
        
        # Layer 3: Prompt Optimization (use provider caching for system prompt)
        system_messages = []
        if system_prompt:
            system_messages = [
                {
                    "type": "text",
                    "text": system_prompt,
                    "cache_control": {"type": "ephemeral"},
                }
            ]
        
        # Make the API call
        response = self.client.messages.create(
            model=model,
            max_tokens=1024,
            system=system_messages if system_messages else [],
            messages=[{"role": "user", "content": prompt}],
        )
        
        result = response.content[0].text
        
        # Cache the response for future queries
        self.cache.store(prompt=prompt, response=result)
        
        self.request_count += 1
        
        return LLMResponse(
            content=result,
            model_used=model,
            cached=False,
            estimated_cost=self._estimate_cost(response.usage, model),
        )
    
    def _estimate_cost(self, usage, model: str) -> float:
        rates = {
            "claude-haiku-4-5": (0.001, 0.005),
            "claude-sonnet-4-5": (0.003, 0.015),
            "claude-opus-4-6": (0.015, 0.075),
        }
        input_rate, output_rate = rates.get(model, (0.003, 0.015))
        return (
            usage.input_tokens * input_rate / 1000
            + usage.output_tokens * output_rate / 1000
        )
    
    def stats(self) -> dict:
        total = self.request_count + self.cache_hits
        return {
            "total_queries": total,
            "cache_hits": self.cache_hits,
            "cache_hit_rate": self.cache_hits / max(total, 1),
            "api_calls_saved": self.cache_hits,
        }

Monitoring and Measuring Cost Optimization

You can't optimize what you don't measure. Every production LLM deployment needs cost observability — and I'd argue this should be one of the first things you set up, not the last.

Essential KPIs

  • Cost per query: Total spend divided by total queries. Track this per model, per endpoint, and per customer segment.
  • Token efficiency ratio: Useful output tokens divided by total tokens (input + output). A low ratio suggests bloated prompts or verbose outputs.
  • Cache hit rate: Percentage of queries served from cache. Target 60%+ for customer-facing workloads with repetitive patterns.
  • Model routing distribution: What percentage of queries go to each model tier. If 90% of queries are hitting your premium tier, your router needs tuning.
  • Cost per successful outcome: Not every LLM call produces a useful result. Track the cost per successful task completion, not just per API call. This is the metric that actually matters for your business.

Building a Cost Dashboard

If you're using LiteLLM as your gateway, cost tracking comes built-in. For custom setups, instrument your pipeline to emit structured cost events:

import json
import logging
from datetime import datetime, timezone

cost_logger = logging.getLogger("llm_costs")
cost_logger.setLevel(logging.INFO)

handler = logging.FileHandler("llm_costs.jsonl")
cost_logger.addHandler(handler)


def log_llm_cost(
    model: str,
    input_tokens: int,
    output_tokens: int,
    cached: bool,
    cache_read_tokens: int = 0,
    endpoint: str = "",
    user_id: str = "",
):
    """Log structured cost data for analysis."""
    
    rates = {
        "claude-haiku-4-5": (0.001, 0.005),
        "claude-sonnet-4-5": (0.003, 0.015),
        "claude-opus-4-6": (0.015, 0.075),
        "gpt-4o": (0.0025, 0.01),
        "gpt-4o-mini": (0.00015, 0.0006),
    }
    
    input_rate, output_rate = rates.get(model, (0.003, 0.015))
    
    # Cache reads cost 10% of normal input rate
    standard_input_cost = input_tokens * input_rate / 1000
    cache_savings = cache_read_tokens * (input_rate * 0.9) / 1000
    actual_cost = standard_input_cost - cache_savings + (output_tokens * output_rate / 1000)
    
    event = {
        "timestamp": datetime.now(timezone.utc).isoformat(),
        "model": model,
        "input_tokens": input_tokens,
        "output_tokens": output_tokens,
        "cache_read_tokens": cache_read_tokens,
        "cached_response": cached,
        "cost_usd": round(actual_cost, 6),
        "cache_savings_usd": round(cache_savings, 6),
        "endpoint": endpoint,
        "user_id": user_id,
    }
    
    cost_logger.info(json.dumps(event))

Feed these logs into your observability stack (Grafana, Datadog, or even a simple Pandas analysis) to spot optimization opportunities: endpoints with low cache hit rates, users consuming disproportionate resources, or routing decisions that could be improved.

Advanced Techniques for 2026

Speculative Decoding for Self-Hosted Models

If you're running self-hosted models, speculative decoding is worth looking into. It uses a small "draft" model to generate candidate tokens that the larger "target" model can verify in parallel. This doesn't reduce cost per token, but it increases throughput by 2-3x — meaning you need fewer GPUs to serve the same traffic. Google's Speculative Cascades paper shows this technique integrating naturally with model cascading for optimal cost-quality-latency tradeoffs.

Fine-Tuned Small Models as Routing Targets

This is a pattern I'm seeing more and more in production: fine-tune a small, cheap model on your specific domain, then route the majority of domain-specific queries to it. A fine-tuned Llama 3.1 8B model costs effectively nothing to run compared to Claude Opus, and for narrow, well-defined tasks (classification, extraction, formatting), it can match frontier model quality. Reserve the expensive models for truly open-ended or novel queries.

Dynamic Budget Allocation

Implement per-user or per-endpoint cost budgets that automatically adjust routing aggressiveness. When a user is under budget, route more queries to premium models for better quality. As they approach their limit, automatically shift to cheaper models. This provides a natural quality-cost gradient that keeps total spend predictable:

class BudgetAwareRouter:
    """Routes requests based on remaining budget."""
    
    def __init__(self, daily_budget_usd: float):
        self.daily_budget = daily_budget_usd
        self.spent_today = 0.0
    
    def get_model(self, complexity: Complexity) -> str:
        budget_remaining_pct = 1 - (self.spent_today / self.daily_budget)
        
        if budget_remaining_pct < 0.1:
            # Under 10% budget remaining — force cheapest model
            return "claude-haiku-4-5"
        elif budget_remaining_pct < 0.3:
            # Under 30% — cap at mid-tier
            if complexity == Complexity.COMPLEX:
                return "claude-sonnet-4-5"
            return "claude-haiku-4-5"
        else:
            # Plenty of budget — route normally
            return MODELS[complexity].name
    
    def record_spend(self, cost: float):
        self.spent_today += cost

Real-World Savings: What to Expect

Let's get concrete. Here's a realistic breakdown of savings for a typical production application processing 100,000 queries per day with a mix of simple and complex requests:

  • Baseline (no optimization): All queries to Claude Sonnet, no caching = ~$450/day
  • + Semantic caching (65% hit rate): 35,000 API calls instead of 100,000 = ~$158/day (65% reduction)
  • + Model routing (70% simple, 25% moderate, 5% complex): Weighted average cost drops = ~$85/day (46% further reduction)
  • + Prompt caching (50K token system prompt): 90% discount on repeated prefix = ~$62/day (27% further reduction)
  • + Batch processing for 30% of non-real-time workload: 50% discount on batch portion = ~$53/day (15% further reduction)

Total: from $450/day to $53/day — an 88% reduction. At scale, that's the difference between an AI product that's profitable and one that isn't. I've watched teams go from panicking about their API bills to comfortably scaling up their usage after implementing these layers.

Getting Started: Your First Week

Don't try to implement everything at once. Seriously. Here's a prioritized action plan that lets you capture the biggest wins first:

  1. Day 1-2: Instrument costs. Add token counting and cost logging to every LLM call. You need data before you can optimize — and you'll probably be surprised by what you find.
  2. Day 3: Enable provider prompt caching. If you're using Anthropic or OpenAI and have static system prompts, adding cache_control is a 5-minute change with immediate 90% savings on cached portions. This is the lowest-effort, highest-return optimization on this list.
  3. Day 4-5: Implement semantic caching. Deploy Redis, set up RedisVL's SemanticCache, and wrap your LLM calls. Start with conservative similarity thresholds and loosen as you validate quality.
  4. Day 6-7: Add model routing. Start with simple heuristic routing (keyword-based complexity classification). Monitor quality metrics to ensure cheaper models meet your bar. Refine the classifier over time.

Each step compounds on the previous one. By the end of the first week, most teams see 50-70% cost reduction — and you haven't even touched batch processing or advanced techniques yet.

Conclusion

LLM cost optimization isn't a one-time project — it's an ongoing discipline. As models get cheaper (and they will — today's $15/million-token model will be $1.50 in two years), the techniques in this guide remain relevant because usage scales faster than prices drop. The teams that build cost-aware infrastructure today will be the ones that can afford to deploy AI at the scale their users demand.

The key insight is that cost optimization and quality optimization aren't opposed — they're aligned. Caching serves faster, more consistent responses. Routing ensures each query gets the right model, not just the biggest one. Prompt optimization forces clarity in your instructions. These techniques make your AI applications better and cheaper.

Start with measurement, add caching, then layer in routing and batching. The compound savings are real, and they start on day one.

About the Author Editorial Team

Our team of expert writers and editors.