Is your RAG agent struggling with accuracy or making things up? The culprit might be the 'lost context problem,' where chunking separates vital information from its original meaning. This article dives into why standard RAG falls short and introduces two powerful techniques—Late Chunking and Contextual Retrieval—to overcome this challenge. If you're building or refining RAG systems and want more reliable, context-aware AI responses, read on to learn how these methods work and how they can be implemented.
The Achilles' Heel of Standard RAG: The Lost Context Problem
If you've built a Retrieval-Augmented Generation (RAG) agent, you've likely faced the frustration of inaccurate answers or outright hallucinations. Often, this isn't the fault of the Large Language Model (LLM) itself, but rather a fundamental issue in how information is prepared: the lost context problem.
Let's illustrate with a simple example using a Wikipedia article about Berlin:
Imagine we split the article into sentences (chunks):
- "Berlin is the capital of Germany."
- "Its population is X." (Referring to Berlin)
- "The city is one of the states of Germany." (Referring to Berlin)
In standard RAG, each chunk is often processed independently. They're turned into vector embeddings and stored. The problem? Only the first chunk explicitly mentions "Berlin." The second and third chunks rely on the context of the first chunk to be understood.
When processed in isolation, these subsequent chunks lose their connection to "Berlin." During retrieval, a query for "Berlin" might easily find the first chunk but miss the others, as they lack the keyword and their embeddings don't reflect the necessary context.
This leads to:
- Incomplete Answers: The RAG system doesn't retrieve all relevant information.
- Inaccurate Answers/Hallucinations: Irrelevant chunks might score higher and be fed to the LLM, leading it astray.
Quick Refresher: RAG and Chunking
- RAG: Enhances LLM responses by first retrieving relevant information snippets (chunks) from a knowledge base. This provides the LLM with specific, timely data.
- Chunking: The process of splitting documents into manageable segments for RAG. Strategies vary (sentences, paragraphs, fixed size, overlap), and the best choice depends on the data.
Technique 1: Late Chunking - Embedding with Context First
Introduced by Jina AI late last year, Late Chunking flips the standard RAG process on its head, leveraging long-context embedding models.
- Standard RAG: Chunks document ➔ Embeds chunks (Context often lost)
- Late Chunking: Embeds document (preserving context) ➔ Chunks document text ➔ Associates embeddings with text chunks
How Late Chunking Works
- Load & Embed Document: Load the entire document (or a large section) into a long-context embedding model. The model generates embeddings for all tokens simultaneously, capturing the overall context.
- Chunk Document Text: Apply your preferred chunking strategy (e.g., sentences, paragraphs) to the text only.
- Associate Embeddings: Map the token embeddings generated in step 1 to their corresponding text chunks from step 2.
- Pool Embeddings: Combine the token embeddings for each chunk into a single representative vector (e.g., through averaging/pooling).
- Store Embeddings: Store these context-aware chunk embeddings in the vector database.
Because the initial embeddings captured the full context, chunks like "Its population is X" retain their connection to "Berlin" in their vector representation.
Key Requirement: Long-context embedding models are essential. Check the Hugging Face Embedding Leaderboard for options like Mistral's and Qwen's (up to 32k tokens). The example uses Jina AI's V3 model (~8k tokens).
N8N Implementation Notes
Implementing late chunking in N8N currently requires some custom work, as native nodes don't fully support the required models or parameters (like Jina's use_late_chunking
flag).
The described workflow involves:
- Fetching and extracting text (e.g., from a PDF).
- Handling Very Large Documents: If the document exceeds the embedding model's context limit (e.g., Jina V3's ~8k tokens), generate a summary of the entire document first. This summary is later prepended to large chunks before embedding to retain some overall context.
- Large Chunking (Code Node): Split the document into large chunks fitting the embedding model's limit (e.g., 28,000 characters for Jina V3). Custom JavaScript is needed as N8N's text splitters are tied to vector store nodes.
- Loop & Granular Chunking (Code Node): Within each large chunk, apply finer chunking (e.g., 1000 characters, 200 overlap). Again, custom JS is used.
- Aggregate & Add Summary: Collect granular chunks from the large chunk. Prepend the overall document summary if created.
- Embed via API (HTTP Request Node): Send the list of granular chunks together to the Jina AI embedding API, enabling
use_late_chunking
and settingtask_type
toretrieval_passage
. - Format & Upsert (Code Node, HTTP Request Node): Prepare the data (embeddings, text, IDs) for the vector store (e.g., Quadrant) and upsert via its API.
Testing Insights: In tests with F1 regulations, late chunking provided accurate answers and retrieved more detailed information compared to a basic RAG setup, suggesting improved chunk relevance.
Evaluation: Jina AI's benchmarks show late chunking significantly improves retrieval, especially for longer documents where more context can be lost.
Technique 2: Contextual Retrieval - Adding LLM-Generated Context
Introduced by Anthropic, Contextual Retrieval uses the power of long-context LLMs (not embedding models) to explicitly add context to each chunk before embedding.
How Contextual Retrieval Works
- Load & Chunk Document: Split the document into chunks using your preferred strategy.
- Generate Contextual Descriptions (LLM): For each chunk:
- Send the chunk and the original full document to a long-context LLM.
- Prompt the LLM to generate a brief (e.g., one-sentence) description explaining how this chunk relates to the whole document.
- Combine Text: Prepend the LLM's description to the original chunk text.
- Embed Combined Text: Use a standard embedding model to embed this description-enhanced chunk.
- Store Embedding: Store the resulting context-rich embedding.
Using the Berlin example, for "Its population is X," the LLM might generate: "This chunk provides population data for Berlin." The text embedded would be: "This chunk provides population data for Berlin. Its population is X."
The Power of Context Caching
Calling an LLM for every chunk, potentially sending a large document each time, sounds slow and expensive. Context Caching (or prompt caching), offered by models like Google's Gemini and Anthropic's Claude, makes this feasible.
How Caching Helps: The large document is sent and cached once. Subsequent calls only need to send the new chunk and prompt, referencing the cached document ID. This drastically reduces processing time and cost.
N8N Implementation Notes
This also requires custom N8N setup, leveraging an LLM with caching (e.g., Gemini 1.5 Flash).
The workflow involves:
- Fetching and extracting text.
- Estimating Tokens & Caching (Code Node, HTTP Request Node): Check if the document is large enough for caching (e.g., >32k tokens for Gemini). If so, call the Gemini
cacheContents
API, storing the returned cache ID. - Granular Chunking (Code Node): Split the document text into chunks.
- Loop & Batching (Loop Over Items Node): Process chunks in batches (e.g., 25) with delays (e.g., 30s) to avoid hitting LLM API rate limits (especially tokens-per-minute).
- Create Prompt (Code Node): Construct the prompt for the LLM, including instructions (Anthropic's suggested prompt works well), the chunk text, and the cached document reference.
- Generate Description (Gemini Node / HTTP Request Node): Call the LLM (e.g., Gemini 1.5 Flash) with the prompt.
- Combine Texts (Code Node): Prepend the generated description to the chunk text.
- Embed & Upsert (Quadrant Output Node): Use a standard embedding model (e.g., OpenAI Ada) to embed the combined text. Configure the node's text splitter with a huge chunk size to prevent re-chunking. Upsert to the vector store.
Challenges & Considerations:
- Rate Limits: Hitting LLM API limits (especially token limits, which often include cached tokens) is a major hurdle. Batching, delays, and retry logic are essential.
- Ingestion Time: This method is much slower than standard RAG or late chunking (e.g., 27 minutes for a 180-page PDF).
- Cost: While caching reduces costs significantly compared to no caching, it's still more expensive than embedding-only methods. The example F1 document cost ~$1.30 to process with Gemini 1.5 Flash. This may be prohibitive for massive datasets but justifiable for high-value information where accuracy is paramount.
Testing Insights: Contextual retrieval produced the most detailed and thorough answers in the F1 tests. The LLM-generated descriptions effectively grounded technical terms, likely leading to superior retrieval relevance.
Evaluation: Anthropic's benchmarks showed contextual retrieval reduced retrieval failures significantly (35% improvement, 67% with a reranker like Cohere).
Conclusion: Choosing the Right Technique
Both Late Chunking and Contextual Retrieval offer compelling solutions to the lost context problem in RAG, significantly boosting retrieval accuracy.
- Late Chunking:
- Pros: Faster ingestion, lower cost than Contextual Retrieval, leverages context well.
- Cons: Requires specific long-context embedding models, needs custom implementation in tools like N8N currently.
- Contextual Retrieval:
- Pros: Excellent context embedding via LLM descriptions, uses standard embedding models.
- Cons: Much slower ingestion, higher cost (even with caching), prone to LLM rate limiting issues, requires long-context LLMs with caching.
The best approach depends on your specific needs:
- Consider Late Chunking if you have access to suitable embedding models and prioritize speed and lower cost.
- Consider Contextual Retrieval for smaller, high-value datasets where maximum accuracy justifies the higher ingestion time and cost, provided you can manage rate limits.
Ultimately, testing these techniques with your own data and use case is crucial for determining the optimal path forward.
Related Resources
- Contextual Retrieval (Anthropic): https://www.anthropic.com/news/contextual-retrieval
- Jina AI Embeddings: https://jina.ai/embeddings/
- Jina AI Late Chunking Article: https://jina.ai/news/late-chunking-in-long-context-embedding-models/
- Gemini API Context Caching: https://ai.google.dev/gemini-api/docs/caching
- Embedding Model Leaderboard: https://huggingface.co/spaces/mteb/leaderboard