Engineering
October 22, 2024

How Ragie Outperformed the FinanceBench Test

Mohammed Rafiq
,
Co-Founder and CTO

In this article, we’ll walk you through how Ragie handled the ingestion of over 50,000+ pages in the FinanceBench dataset (360 PDF files, each roughly 150-250 pages long) in just 4 hours and outperformed the benchmarks in key areas like the Shared Store configuration, where we beat the benchmark by 42%.

For those unfamiliar, the FinanceBench is a rigorous benchmark designed to evaluate RAG systems using real-world financial documents, such as 10-K filings and earnings reports from public companies. These documents are dense, often spanning hundreds of pages, and include a mixture of structured data like tables and charts with unstructured text, making it a challenge for RAG systems to ingest, retrieve, and generate accurate answers.

In the FinanceBench test, RAG systems are tasked with answering real-world financial questions by retrieving relevant information from a dataset of 360 PDFs. The retrieved chunks are fed into a large language model (LLM) to generate the final answer. This test pushes RAG systems to their limits, requiring accurate retrieval across a vast dataset and precise generation from complex financial data.

The Complexity of Document Ingestion in FinanceBench

Ingesting complex financial documents at scale is a critical challenge in the FinanceBench test. These filings contain crucial financial information, legal jargon, and multi-modal content, and they require advanced ingestion capabilities to ensure accurate retrieval.

  • Document Size and Format Complexity: Financial datasets consist of structured tables and unstructured text, requiring a robust ingestion pipeline capable of parsing and processing both data types. 
  • Handling Large Documents: The 10-K can be overwhelming as the document often exceeds 150 pages, so your RAG system must efficiently manage thousands of pages and ensure that ingestion speed does not compromise accuracy (a tough capability to build). 

How we Evaluated Ragie using the FinanceBench test

The RAG system was tasked with answering 150 complex real-world financial questions. This rigorous evaluation process was pivotal in understanding how effectively Ragie could retrieve and generate answers compared to the gold answers set by human annotators. 

Each entry features a question (e.g., "Did AMD report customer concentration in FY22?"), the corresponding answer (e.g., “Yes, one customer accounted for 16% of consolidated net revenue”), and an evidence string that provides the necessary information to verify the accuracy of the answer, along with the relevant document's page number. 

Grading Criteria:
  1. Accuracy: Matching the gold answers for correct responses.
  2. Refusals: Cases where the LLM avoided answering, reducing the likelihood of hallucinations.
  3. Inaccurate Responses: Instances where incorrect answers were generated.

Ragie’s Performance vs. FinanceBench Benchmarks

We evaluated Ragie across two configurations:

Single-Store Retrieval
: In this setup, the vector database contains chunks from a single document, and retrieval is limited to that document. Despite being simpler, this setup still presents challenges when dealing with large, complex financial filings. 

We matched the benchmark for Single Vector Store retrieval, achieving 51% accuracy using the setup below:

Top_k=32, No rerank

Shared Store Retrieval: In this more complex setup, the vector database contains chunks from all 360 documents, requiring retrieval across the entire dataset. Ragie had a 27% accuracy compared to the benchmark of 19% for Shared Store retrieval, outperforming the benchmark by 42% using this setup:

Top_k=8, No rerank

The Shared Store retrieval is a more challenging task since retrieval happens across all documents simultaneously; ensuring relevance and precision becomes significantly more difficult because the RAG system needs to manage content from various sources and maintain high retrieval accuracy despite the larger scope of data.

Key Insights:

  • In a second Single Store run with top_k=8, we ran two tests with rerank on and off:
    - Without rerank, the test was 50% correct, 32% refusals, and 18% incorrect answers.
    - With rerank on, the test was 50% correct, but refusals increased to 37%, and incorrect answers dropped to 13%.
    - Conclusion: Reranking effectively reduced hallucinations by 16%
  • There was no significant difference between GPT-4o and GPT-4 Turbo’s performance during this test.

Why Ragie Outperforms: The Technical Advantages

  • Advanced Ingestion Process: Ragie's advanced extraction in hi_res mode enables it to extract all the information from the PDFs using a multi-step extraction process described below:

    - Text Extraction
    : Firstly, we efficiently extract text from PDFs during ingestion to retain the core information.
    - Tables and Figures
    : For more complex elements like tables and images, we use advanced optical character recognition (OCR) techniques to extract structured data accurately.
    - LLM Vision Models: Ragie also uses LLM vision models to generate descriptions for images, charts, and other non-text elements. This adds a semantic layer to the extraction process, making the ingested data richer and more contextually relevant.
  • Hybrid Search: We use hybrid search by default, which gives you the power of semantic search (for understanding context) and keyword-based retrieval (for capturing exact terms). This dual approach ensures precision and recall. For example, financial jargon will have a different weight in the FinanceBench dataset, significantly improving the relevance of retrievals.
  • Scalable Architecture: While many RAG systems experience performance degradation as dataset size increases, Ragie’s architecture maintains high performance even with 50,000+ pages. Ragie also uses summary index for hierarchical and hybrid hierarchical search; this enhances the chunk retrieval process by processing chunks in layers and ensuring that context is preserved to retrieve highly relevant chunks for generations. 

Conclusion

Before making a Build vs Buy decision, developers must consider a range of performance metrics, including scalability, ingestion efficiency, and retrieval accuracy. In this rigorous test against FinanceBench, Ragie demonstrated its ability to handle large-scale, complex financial documents with exceptional speed and precision, outperforming the Shared Store accuracy benchmark by 42%.

If you’d like to see how Ragie can handle your own large-scale or multi-modal documents, you can try Ragie’s Free Developer Plan. 

Feel free to reach out to us at support@ragie.ai if you're interested in running the FinanceBench test yourself.