RAG Chunking Strategies in 2026: Late Chunking, Contextual Retrieval, and Semantic Splitting

How to chunk documents for RAG in 2026: contextual retrieval, late chunking, semantic splitting, plus production Python code, benchmarks, and evaluation tips.

RAG Chunking Strategies Guide (2026)

Updated: May 26, 2026

The best RAG chunking strategy in 2026 is a hybrid of contextual retrieval (prepending a short LLM-generated context to each chunk before embedding) and late chunking (embedding the whole document with a long-context model, then pooling token vectors into chunks). Together they cut retrieval failure rates by 35–67% versus naive fixed-size chunking, according to Anthropic's published benchmarks and Jina AI's late-chunking paper. This guide shows how each technique works, when to use it, and how to combine them with semantic splitting in a production Python pipeline.

  • Fixed-size chunking is the wrong default in 2026. It cuts mid-sentence, breaks tables, and strips the context an embedding model needs to disambiguate references.
  • Contextual retrieval (Anthropic, 2024) prepends a 50–100 token document-aware summary to each chunk before embedding, reducing top-20 retrieval failures by ~35% on Anthropic's evals; combined with BM25 and reranking, by ~67%.
  • Late chunking (Jina AI, 2024) runs the entire document through a long-context embedding model first, then mean-pools token embeddings into chunk vectors so every chunk inherits cross-document context for free.
  • Semantic splitting uses embedding-similarity drops between sentences to find natural breakpoints. It beats fixed sizes on prose, but loses to document-aware splitters for code and Markdown.
  • Optimal chunk size is task-specific: 200–400 tokens for QA over technical docs, 800–1200 for summarization, 50–150 for code search. Always evaluate with recall@k, not vibes.
  • Prompt caching makes contextual retrieval affordable at scale. The document is cached once, and each chunk re-uses the cache for under $1 per million chunks.

Why chunking decides RAG quality

Every retrieval-augmented generation system makes one decision before it makes any others: how to slice source documents into the units that get embedded and stored. That single decision has a larger effect on end-to-end answer quality than the choice of embedding model, the vector database, or even the LLM, because no downstream component can recover information that the chunker discarded. A 2025 study from Databricks evaluating 12 chunking strategies across enterprise corpora found a 41-point spread in recall@10 between the worst and best splitters using the same embedding model.

Honestly, this is the part I see teams underestimate the most. I've shipped two production RAG systems where swapping the chunker (and nothing else) moved end-to-end accuracy by double-digit points. The reason chunking matters so much is that embedding models compress text into a single vector. If a chunk contains two unrelated topics, its vector lands halfway between them and matches neither query well. If a chunk lacks the subject (“the company” with no antecedent), the embedding model can't disambiguate it. And if a chunk splits a table across rows, the resulting fragments are useless to the LLM at generation time.

Good chunking aims for three properties at once: coherence (one topic per chunk), self-containment (no dangling references), and recall granularity (small enough that the answer-bearing span dominates the chunk's vector). The techniques in this guide attack those three properties from different angles. Semantic splitting targets coherence by cutting at topic shifts. Late chunking targets self-containment by letting every chunk see the whole document. Contextual retrieval brute-forces self-containment with an LLM-written preamble. The right production system usually combines two of them.

What is the best chunk size for RAG?

There's no universal best chunk size, but a defensible default for most English technical content is 300 tokens with 50 tokens of overlap, measured against a recall@10 evaluation. That number falls out of two competing pressures. Embeddings get noisier as chunks grow because the vector averages more concepts. Below 100 tokens, vectors get noisier in the other direction because there isn't enough signal to anchor the meaning. The sweet spot for OpenAI's text-embedding-3-large and Cohere's embed-english-v3.0 on technical documentation lands at 250–400 tokens in benchmarks published by both vendors.

Task type shifts the optimum more than corpus type does. For extractive question answering (where the answer is a phrase in the source), smaller chunks of 150–250 tokens win because they concentrate the answer in the vector. For summarization or comparative reasoning over long passages, larger chunks of 800–1200 tokens win because the LLM needs surrounding context to write a useful answer. For code search, chunks should follow function or class boundaries regardless of token count, which usually means 50–200 tokens per chunk with no overlap.

Overlap matters less than people think. The classic recommendation of 10–20% overlap helps when the answer-bearing sentence straddles a boundary, but with semantic splitting (which cuts at topic shifts rather than fixed offsets) overlap can be reduced to zero with no measurable recall loss. The one case where high overlap (up to 50%) still pays is when chunks feed a reranker that scores cross-chunk relationships. See our guide to hybrid search RAG with cross-encoder reranking for the reranking side of that trade-off.

Fixed-size vs semantic chunking: which is better?

Semantic chunking wins on prose, fixed-size wins on uniform records, and document-aware splitters beat both on structured content. The mechanics differ in where the cut happens. A fixed-size splitter walks the text and cuts every N tokens. A semantic splitter embeds each sentence, computes the cosine similarity between consecutive sentences, and cuts where similarity drops below a threshold (signaling a topic shift). LangChain's SemanticChunker and LlamaIndex's SemanticSplitterNodeParser both implement variations of the algorithm originally proposed by Greg Kamradt in 2023.

Here is a minimal semantic splitter using LangChain 0.3 and OpenAI embeddings:

from langchain_experimental.text_splitter import SemanticChunker
from langchain_openai import OpenAIEmbeddings

# breakpoint_threshold_type options: percentile, standard_deviation,
# interquartile, gradient. percentile=95 is the published default.
splitter = SemanticChunker(
    OpenAIEmbeddings(model="text-embedding-3-large"),
    breakpoint_threshold_type="percentile",
    breakpoint_threshold_amount=95,
    min_chunk_size=200,
)

with open("docs/architecture.md", "r") as f:
    text = f.read()

chunks = splitter.create_documents([text])
print(f"Produced {len(chunks)} semantic chunks")
for i, c in enumerate(chunks[:3]):
    print(f"--- chunk {i} ({len(c.page_content)} chars) ---")
    print(c.page_content[:200])

Semantic chunking isn't free. It costs one embedding call per sentence at indexing time, which adds 2–5x to your ingestion bill versus fixed-size. The win shows up in recall on conversational or narrative content where paragraph boundaries don't align with topic boundaries. On well-structured technical documentation with headings, a heading-aware splitter usually beats both fixed and semantic chunking because Markdown headings already encode topic structure.

How does late chunking work?

Late chunking inverts the usual order. Instead of splitting the document and then embedding each chunk, you embed the entire document with a long-context model and split the resulting token-level embeddings into chunk vectors afterward. The technique was introduced by Jina AI in 2024 (see the Jina AI late chunking announcement) and works with any embedding model that exposes token-level outputs. jina-embeddings-v3, nomic-embed-text-v1.5, and bge-m3 all support it natively.

The mechanical difference is small but the semantic difference is large. In traditional chunking, the embedding for “The model uses 4-bit quantization” has no information about which model. With late chunking, that sentence's tokens were processed alongside the rest of the document, so its vector already encodes “Llama 3.1 70B” (or whatever was mentioned three paragraphs earlier). You get coreference resolution and cross-document context for free, without writing any extra LLM calls. I hit this exact pain point shipping an internal product docs bot last spring: traditional chunking kept returning snippets about “the limit” without any way to tell which limit, and swapping to late chunking fixed the top-3 recall in an afternoon.

Here's a working late chunking implementation with jina-embeddings-v3:

import requests
import numpy as np

API_URL = "https://api.jina.ai/v1/embeddings"
HEADERS = {"Authorization": "Bearer YOUR_JINA_KEY",
           "Content-Type": "application/json"}

def late_chunk_embed(document: str, chunk_size_tokens: int = 256) -> list[dict]:
    # Embed a document once, then pool token vectors into chunks.
    payload = {
        "model": "jina-embeddings-v3",
        "input": [document],
        "late_chunking": True,
        "task": "retrieval.passage",
    }
    r = requests.post(API_URL, json=payload, headers=HEADERS, timeout=60)
    r.raise_for_status()
    data = r.json()["data"][0]
    # Jina returns one vector per chunk when late_chunking=True and
    # an explicit chunk segmentation is supplied; here we let the API
    # auto-segment using its sentence boundary detector.
    return [
        {"text": seg["text"], "embedding": np.array(seg["embedding"])}
        for seg in data["chunks"]
    ]

doc = open("docs/architecture.md").read()
chunks = late_chunk_embed(doc)
print(f"{len(chunks)} late-chunked vectors, dim={len(chunks[0]['embedding'])}")

The catch is the context window. jina-embeddings-v3 handles 8,192 tokens; longer documents must be split into overlapping mega-chunks first (typically 6,000 tokens with 500 of overlap) and late-chunked within each. nomic-embed-text-v1.5 supports 32K and BAAI/bge-m3 supports 8,192. None yet handle a million tokens, so late chunking isn't a replacement for hierarchical retrieval on book-length sources.

What is contextual retrieval and when should you use it?

Contextual retrieval is Anthropic's 2024 technique for fixing the “dangling reference” problem. An LLM writes a 50–100 token preamble for each chunk that places it in document context. The preamble is prepended to the chunk text before embedding, so the resulting vector encodes both the chunk's local meaning and its global role. Anthropic's contextual retrieval paper reports a 35% reduction in top-20 retrieval failures using contextual embeddings alone, and 67% when combined with contextual BM25 and a Cohere reranker.

So, when do you actually need it? Use contextual retrieval when your chunks contain references that can't be resolved locally: pronouns, definite articles (“the API”, “the customer”), table rows extracted from larger tables, code snippets pulled out of files, or financial figures whose units appear in the document header. Skip it when chunks are already self-contained, like FAQ entries, product cards, or wiki paragraphs with their own subject.

The naive implementation costs roughly $1 per million chunks at 2026 prices when using Claude Haiku with prompt caching. The document is cached once per ingestion run, and each chunk re-uses that cache. Without caching, the cost would be 10–30x higher. See our deep dive on prompt caching for cost reduction for the caching mechanics.

import anthropic

client = anthropic.Anthropic()

CONTEXT_PROMPT = (
    "<document>\n{document}\n</document>\n\n"
    "Here is a chunk we want to situate within the whole document:\n"
    "<chunk>\n{chunk}\n</chunk>\n\n"
    "Give a short (50-100 token) context that situates this chunk "
    "within the document for improving search retrieval. Answer with "
    "only the context, no preamble."
)

def contextualize_chunk(document: str, chunk: str) -> str:
    # Generate a context preamble using Claude Haiku with prompt caching.
    msg = client.messages.create(
        model="claude-haiku-4-5",
        max_tokens=150,
        messages=[{
            "role": "user",
            "content": [
                # Cache the document so subsequent chunks re-use it
                {"type": "text",
                 "text": f"<document>\n{document}\n</document>",
                 "cache_control": {"type": "ephemeral"}},
                {"type": "text",
                 "text": CONTEXT_PROMPT.format(document="(cached above)",
                                               chunk=chunk)},
            ],
        }],
    )
    return msg.content[0].text.strip()

def embed_with_context(document: str, chunks: list[str]) -> list[str]:
    # Prepend an LLM-written context to each chunk before embedding.
    enriched = []
    for chunk in chunks:
        ctx = contextualize_chunk(document, chunk)
        enriched.append(f"{ctx}\n\n{chunk}")
    return enriched  # feed these to your embedding model

Contextual retrieval and late chunking aren't mutually exclusive. Late chunking handles in-document coreference cheaply, and contextual retrieval handles missing meta-context (document type, source system, publication date) that a single document doesn't contain. A solid 2026 pipeline often uses both.

Document-aware splitting for code, Markdown, and tables

Code, tables, and structured documents lose all meaning when split by character count. For these formats, the chunker should respect the document's native structure. Markdown should split at heading boundaries. Code should split at function or class boundaries. HTML tables should be chunked row-wise with the header row prepended to every chunk. JSON should be flattened to key-paths.

LangChain's RecursiveCharacterTextSplitter handles this by accepting a priority-ordered list of separators (paragraph breaks, then newlines, then sentences, then words). For Markdown, use the language-specific variant:

from langchain.text_splitter import (
    RecursiveCharacterTextSplitter, Language
)

md_splitter = RecursiveCharacterTextSplitter.from_language(
    language=Language.MARKDOWN,
    chunk_size=400,
    chunk_overlap=50,
)
py_splitter = RecursiveCharacterTextSplitter.from_language(
    language=Language.PYTHON,
    chunk_size=300,
    chunk_overlap=0,  # code rarely benefits from overlap
)

md_chunks = md_splitter.create_documents([open("README.md").read()])
py_chunks = py_splitter.create_documents([open("server.py").read()])

For tables specifically, the highest-recall pattern is to extract each row, prepend the column headers and the table caption (if any), and treat the result as one chunk. That turns a 50-row financial table into 50 self-contained record-style chunks. Tools like unstructured.io, marker, and Mistral OCR all expose structured output that makes this transformation straightforward, and pairing it with the LLM data extraction patterns we covered with Instructor and Pydantic lets you validate the row schema during ingestion.

How to evaluate a chunking strategy

You can't tune chunking by eyeballing chunks. Build a held-out evaluation set of 50–200 query / golden-answer pairs and measure two things for each chunking strategy: recall@k (does the chunk containing the golden answer appear in the top k results?) and answer quality (does an LLM answering from the retrieved chunks produce the golden answer?). The first isolates retrieval quality; the second tells you whether retrieval quality translates into end-user value.

The RAGAS library remains the most-cited 2026 framework for this, but Phoenix and Trulens are widely used as well. A minimal recall@k evaluator in raw NumPy looks like this:

import numpy as np

def recall_at_k(query_vecs, chunk_vecs, gold_indices, k=10):
    # For each query, did the gold chunk appear in the top-k results?
    sims = query_vecs @ chunk_vecs.T  # cosine if vectors are L2-normalized
    topk = np.argpartition(-sims, k, axis=1)[:, :k]
    hits = [int(g in topk[i]) for i, g in enumerate(gold_indices)]
    return sum(hits) / len(hits)

# Run the same query set through each chunking strategy and compare.
fixed_recall = recall_at_k(qv, fixed_chunks_vecs, fixed_gold, k=10)
semantic_recall = recall_at_k(qv, semantic_chunks_vecs, semantic_gold, k=10)
late_recall = recall_at_k(qv, late_chunks_vecs, late_gold, k=10)
print(f"fixed={fixed_recall:.3f} semantic={semantic_recall:.3f} "
      f"late={late_recall:.3f}")

Two pitfalls to avoid. First, the gold-chunk index changes when you change the chunker, so you have to re-annotate the golden chunks per strategy. A query whose answer was in chunk 47 with fixed-size splitting might be in chunk 23 under semantic splitting. Automate this by storing golden spans (character offsets in the original document) and computing the gold chunk index dynamically. Second, recall@10 hides catastrophic failures at recall@1, so track both.

Putting it together: a production chunking pipeline

A defensible 2026 production pipeline routes documents through different chunkers based on content type, then optionally enriches each chunk with contextual retrieval. The high-level flow:

  1. Classify the document by MIME type and content heuristics: code, Markdown, prose, tabular, PDF.
  2. Route to the appropriate splitter: language-aware for code, heading-aware for Markdown, row-wise for tables, semantic for prose, hierarchical for PDFs.
  3. Late-chunk any document under the embedding model's context limit (8K–32K tokens depending on the model).
  4. Contextualize chunks for closed-domain content where references are likely to be ambiguous, using prompt caching to keep costs flat.
  5. Embed with a model that matches your latency and recall budget. text-embedding-3-large, cohere-embed-v3, jina-embeddings-v3, and nomic-embed-text-v1.5 are all defensible 2026 picks.
  6. Store the chunk vector, the chunk text, the original document id, character offsets, and the generated context (so you can re-embed without re-calling the LLM).
  7. Pair chunking with hybrid search and reranking. The chunker only controls what's findable; the retriever controls what gets found.

A sketch of the routing layer:

from dataclasses import dataclass
from typing import Callable

@dataclass
class ChunkingRoute:
    matches: Callable[[str, str], bool]   # (filename, content) -> bool
    splitter: Callable[[str], list[str]]

routes = [
    ChunkingRoute(
        matches=lambda f, _: f.endswith(".py"),
        splitter=lambda t: [c.page_content
                            for c in py_splitter.create_documents([t])],
    ),
    ChunkingRoute(
        matches=lambda f, _: f.endswith((".md", ".mdx")),
        splitter=lambda t: [c.page_content
                            for c in md_splitter.create_documents([t])],
    ),
    ChunkingRoute(
        matches=lambda _, c: "<table>" in c.lower(),
        splitter=split_tables_rowwise,           # custom
    ),
]

def route_chunks(filename: str, content: str) -> list[str]:
    for r in routes:
        if r.matches(filename, content):
            return r.splitter(content)
    # default: semantic splitter for everything else
    return [c.page_content
            for c in semantic_splitter.create_documents([content])]

Once retrieval is working, the next failure mode is usually answer quality on multi-document questions, which is where techniques like agentic RAG with corrective retrieval and self-reflection start paying off. Chunking gets you in the game; agentic patterns close the last 10% of quality. For an overview of how the built-in splitters compare, the LangChain text splitter concepts page is the canonical reference.

Frequently Asked Questions

Does late chunking replace contextual retrieval?

No. Late chunking solves in-document context (coreference, definite references within the same document) by letting every chunk see the full surrounding text during embedding. Contextual retrieval solves meta-context (which document, which section, which time period) that a chunk alone doesn't contain. The two techniques compose well: late-chunk first, then prepend an LLM-written context.

How much overlap should chunks have?

With a semantic or document-aware splitter, zero overlap is usually fine because cuts happen at natural boundaries. With fixed-size chunking, 10–15% overlap (e.g., 50 tokens on a 400-token chunk) is the published sweet spot. Higher overlap helps only when a reranker downstream can deduplicate near-duplicate chunks.

Is contextual retrieval expensive at scale?

With prompt caching it costs roughly $1 per million chunks using Claude Haiku in 2026, because the document is cached once and re-used for every chunk-context generation. Without caching the cost would be 10–30x higher and the latency penalty would block real-time ingestion pipelines.

What chunk size should I use for code search?

Split at function or class boundaries rather than by token count. Most production code search systems end up with chunks in the 50–200 token range with no overlap. Prepend the file path and surrounding class signature so the embedding can disambiguate identically-named functions across files.

Can I use late chunking with OpenAI embeddings?

Not natively. OpenAI's embedding API doesn't expose token-level vectors, so the pooling step required for late chunking can't be done client-side. Use jina-embeddings-v3, nomic-embed-text-v1.5, or BAAI/bge-m3 instead, all of which support late chunking via the late_chunking=True parameter or by running the model locally.

Editorial Team
About the Author Editorial Team

Our team of expert writers and editors.