Advanced RAG with Document Summarization

Mohammed Rafiq

Co-Founder and CTO

Retrieval-Augmented Generation (RAG) has become a key technique in building applications powered by large language models (LLMs), enabling these models to retrieve domain-specific data from external sources. However, as the document collection grows, the challenge of ensuring comprehensive retrieval across all relevant documents becomes critical.

Ragie has implemented an advanced RAG pipeline that incorporates document summarization to enhance retrieval relevancy and increase the number of documents involved in the result set. This post provides a technical breakdown of how Ragie has designed this system to overcome the limitations of single-step retrieval in traditional RAG setups.

The Limitation of Traditional RAG Systems

In a conventional RAG pipeline, the retrieval process typically follows these steps:

Chunking: Documents are split into smaller, manageable chunks to ensure each query can be matched with more granular data
Embedding: Each chunk is vectorized using an embedding model such as OpenAI’s `text-embedding-3-large` to capture semantic meaning.
Indexing: The chunk embeddings are stored in a vector database, like Pinecone, for fast retrieval.
Retrieval: At query time, the query is vectorized and compared with the stored chunk embeddings to retrieve the top-k matching chunks based on vector similarity.

While this approach works well for smaller datasets, it introduces biases as the dataset grows. The top-k results often come from a single or very few documents, missing out on relevant content spread across the dataset. This imbalance can limit the model’s ability to provide comprehensive responses, especially when relevant information is distributed across multiple documents.

Ragie’s Two-Step Retrieval with Document Summarization

To address these limitations, Ragie has implemented a two-step retrieval process that utilizes document summarization to improve retrieval relevancy and document coverage. The system involves both a Summary Index and a Chunk Index, enabling a more structured approach to retrieving relevant information.

Document Summarization

The first innovation is the automatic summarization of documents. Ragie uses the Gemini 1.5 Flash model for summarization due to its ability to handle large context windows—up to 1 million tokens. The summarization process condenses each document into a single chunk, typically about one-tenth the length of the original document, while preserving the core information.

These document summaries are stored in a dedicated Summary Index, where each summary is associated with its original document. This allows Ragie to perform a quick, high-level search across the entire dataset based on document relevance, rather than directly diving into the chunks.

Embedding and Indexing

Once summarized, these condensed versions are embedded using a higher-dimensional embedding mode. Ragie uses OpenAI’s 3072-dimensional `text-embedding-3-large` model to capture the full semantic meaning of each document’s key information. The Summary Index stores these embeddings alongside the document metadata, allowing for efficient lookup.

Meanwhile, the Chunk Index continues to store the chunk-level embeddings, which are vectorized using OpenAI’s 1536-dimensional `text-embedding-3-large` model. Each chunk is stored with metadata, such as its document ID and other relevant attributes, facilitating the second layer of retrieval.

Two-Tiered Retrieval Process

Ragie’s system performs a two-tiered search for each query, ensuring comprehensive coverage across documents:

Document-Level Retrieval: The first step uses the Summary Index to find the top-k most relevant documents based on cosine similarity between the query and the document summaries. The system retrieves document IDs corresponding to the most relevant summaries, reducing the search space to the documents that matter most for the query.
Chunk-Level Retrieval: Once the relevant documents are identified, Ragie performs a second search, this time within the Chunk Index, using the document IDs from step 1. For each document, the most relevant chunks are retrieved. To control the breadth of retrieval, the system allows developers to configure the `max_chunks_per_document` parameter, limiting the number of chunks returned per document and thus ensuring that the results span across a wider range of documents.

This two-tiered approach ensures that the final top-k results are not only the most relevant chunks but also represent a broader set of documents, addressing the issue of chunking bias that’s common in traditional RAG systems.

Ranking and Final Output

Once both the document and chunk-level retrievals are complete, Ragie ranks all the chunks from the second step based on cosine similarity to the original query. As a last step, we use a LLM re-ranker to rank the final chunk result set. This ensures that the most semantically relevant information is returned, but from a more diversified set of documents.

Conclusion

By implementing document summarization and a two-step retrieval process, Ragie has enhanced the way Retrieval-Augmented Generation works in practice. This approach mitigates the bias toward individual documents often seen in single-step RAG systems, improving retrieval relevancy and ensuring that the information comes from a wider set of documents.

For developers looking to build advanced RAG-based systems, this method provides a more accurate and efficient way to retrieve relevant information, especially in large-scale document sets. With Ragie’s summary-based indexing, the retrieval process becomes not only more performant but also more comprehensive.

‍