Engineering
December 6, 2024

Evaluating Ragie Against Real Legal Documents Using LegalBench-RAG

Mohammed Rafiq
,
Co-Founder and CTO

To ensure Ragie delivers optimal performance in real-world applications, we regularly benchmark it against complex datasets across various domains to evaluate how it would perform in a production environment. In this article, we’ll walk you through how we conducted a focused evaluation of our RAG pipeline using LegalBench-RAG, a benchmark designed to evaluate the retrieval component of RAG systems in the legal domain. 

The LegalBench-RAG focuses on precision (retrieving the most relevant data) and recall (retrieving all relevant data) to measure how effectively RAG systems handle the complexities of legal texts. 

It includes four categories of legal documents:

  1. NDA related documents (ContractNLI)
  2. Private Contracts (Contract Understanding Atticus Dataset - CUAD)
  3. M&A documents of public companies (M&A Understanding Dataset - MAUD)
  4. Privacy policies of consumer apps (Privacy QA)

This benchmarking exercise provides valuable insights into Ragie’s effectiveness in handling legal queries, where accuracy, contextual understanding, and relevance are essential. Through this evaluation, we also demonstrate how Ragie offers both precision and recall that surpass industry standards.

Our Methodology for Evaluating Ragie Against LegalBench-RAG

While we evaluated using the same metrics, we needed to measure them slightly differently due to Ragie's chunking approach, which more closely matches production RAG systems. This differing approach to chunking also highlights the importance of optimal chunking strategies in real-world scenarios.

The Results: Ragie’s Performance vs. LegalBench Benchmarks

We evaluated Ragie's recall and precision across three configurations with k values of 1 to 64 using the recursive text splitter method.

Benchmarking with Hybrid retrieval:

With Hybrid retrieval (semantic + keyword) enabled, which is the default behavior in Ragie—the results in the table below show Ragie’s precision across different k values:

Precision @ k 1 2 4 8 16 32 64
Benchmark Ragie Benchmark Ragie Benchmark Ragie Benchmark Ragie Benchmark Ragie Benchmark Ragie Benchmark Ragie
PrivacyQA 14.38 62.3 13.55 46.3 12.34 40.4 9.03 32.8 6.06 26.5 4.17 21.2 2.81 16.5
ContractNLI 6.63 68 5.29 54.1 3.89 50.2 2.81 44.5 1.98 39 1.29 34.7 0.9 32.9
MAUD 2.65 58.2 1.77 48.9 1.96 45.2 1.4 42.2 1.39 38.3 1.15 35 0.82 32
CUAD 1.97 26.2 4.03 22.4 4.83 22.4 4.2 19.1 2.94 15.6 1.99 12.5 1.25 9.8
ALL 6.41 53.675 6.16 42.925 5.76 39.55 4.36 34.65 3.09 29.85 2.15 25.85 1.45 22.8
Ragie Outperformed By (%) +737% +597% +587% +695% +866% +1102% +1472%

Ragie’s recall across different k values with Hybrid retrieval enabled:

Recall @ k 1 2 4 8 16 32 64
Benchmark Ragie Benchmark Ragie Benchmark Ragie Benchmark Ragie Benchmark Ragie Benchmark Ragie Benchmark Ragie
PrivacyQA 8.85 40.5 15.21 45.5 27.92 63.2 42.37 79.7 55.12 92.9 71.19 98.7 84.19 99.1
ContractNLI 7.63 53.9 11.33 62.4 17.34 81.1 24.99 93.0 35.8 96.1 46.57 96.9 61.72 99.4
MAUD 1.65 28.5 2.09 32.4 4.59 39.9 6.18 50.3 12.93 64.0 21.04 77.4 28.28 84.6
CUAD 1.62 16.7 8.11 26.7 17.72 42.6 31.68 63.6 44.38 80.4 60.04 90.3 74.7 97.1
ALL 4.94 34.9 9.19 41.75 16.9 56.7 26.3 71.65 37.06 83.35 49.71 90.825 62.22 95.05
Ragie Outperformed By (%) +606% +354% +236% +172% +125% +83% +53%

Benchmarking with Hybrid retrieval and Re-ranking:

With Hybrid retrieval and re-ranking enabled, the results in the table below show Ragie’s precision across different k values:

Precision @ k 1 2 4 8 16 32 64
Benchmark Ragie Benchmark Ragie Benchmark Ragie Benchmark Ragie Benchmark Ragie Benchmark Ragie Benchmark Ragie
PrivacyQA 13.94 53.6 15.91 55.4 13.32 55.7 9.57 55.7 6.88 52.7 4.68 49.7 3.28 50.7
ContractNLI 5.08 57.2 5.59 66.7 5.04 63.9 3.67 58.2 2.52 50.1 1.75 46.3 1.17 45.3
MAUD 1.94 28.8 2.63 36.5 2.05 45.5 1.77 54.2 1.79 59.7 1.55 60.6 1.12 60.9
CUAD 3.53 27.3 4.18 33.7 6.18 42.1 5.06 45.6 3.93 48.0 2.74 50.1 1.66 47.1
ALL 6.13 41.7 7.1 48.1 6.6 51.8 5.0 53.4 3.8 52.6 2.7 51.7 1.8 51.0
Ragie Outperformed By (%) +580% +577% +685% +968% +1284% +1815% +2733%
vs Ragie Hybrid Retrieval (%) -22% +12% +31% +54% +76% +100% +124%

With Hybrid retrieval and re-ranking enabled, the results in the table below show Ragie’s recall across different k values:

Recall @ k 1 2 4 8 16 32 64
Benchmark Ragie Benchmark Ragie Benchmark Ragie Benchmark Ragie Benchmark Ragie Benchmark Ragie Benchmark Ragie
PrivacyQA 7.32 33.1 16.13 47.4 25.65 61.0 35.6 72.4 51.87 79.3 64.98 79.61 79.3 80.1
ContractNLI 4.91 49.4 9.33 68.1 16.09 79.8 25.83 84.0 35.04 86.3 46.9 88.5 62.97 86.6
MAUD 0.52 19.4 2.48 24.4 4.39 42.1 7.24 42.1 14.03 50.7 22.6 53.6 31.46 56.5
CUAD 3.17 19.3 7.33 30.1 18.26 46.0 28.67 53.5 42.5 58.0 55.66 61.7 70.19 59.4
ALL 3.98 30.3 8.8 42.5 16.1 54.1 24.3 63.0 35.9 68.5 47.5 70.8 61.1 70.7
Ragie Outperformed By (%) +661% +383% +236% +159% +91% +49% +16%
vs Ragie Hybrid Retrieval (%) -13% +2% -5% -12% -18% -22% -26%

Benchmarking with Hierarchical retrieval:

With Hierarchical retrieval enabled, the results in the table below show Ragie’s precision across different k values (max_chunks_per_doc = top_k):

Precision @ k 1 2 4 8 16 32 64
Ragie (Hierarchical) Ragie (Hierarchical) Ragie (Hierarchical) Ragie (Hierarchical) Ragie (Hierarchical) Ragie (Hierarchical) Ragie (Hierarchical)
PrivacyQA 62.3 53.8 46.8 39.5 35 31.7 31.2
ContractNLI 68 62.1 50.5 42.5 40.4 40.1 40.1
MAUD 59.2 55.4 48.9 43.7 38.9 35.1 31.9
CUAD 26.2 28.6 28.2 22.1 16.7 13.3 12
ALL 53.9 50 43.5 37 32.8 30.1 28.8
vs Ragie Hybrid Retrieval (%) +0.4% +16.5% +10% +6.8% +9.9% +16.4% +26.3%

Ragie’s recall across different k values (max_chunks_per_doc = top k) with Hierarchical retrieval:

Recall @ k 1 2 4 8 16 32 64
Ragie (Hierarchical) Ragie (Hierarchical) Ragie (Hierarchical) Ragie (Hierarchical) Ragie (Hierarchical) Ragie (Hierarchical) Ragie (Hierarchical)
PrivacyQA 40.5 52.7 69.3 81.8 93.4 95.4 95.6
ContractNLI 53.6 70.1 70.8 72.5 72.3 72.8 72.8
MAUD 29.3 38.9 48.1 61 72.9 80.2 87.6
CUAD 16.7 32.4 55.7 75.3 85.4 94.2 98.8
ALL 35 48.5 61 72.7 81 85.7 88.7
vs Ragie Hybrid Retrieval (%) +0.3% +16.2% +7.6% +1.5% -2.8% -5.6% -6.7%

Key Insights from Ragie's Benchmark Analysis

  1. Production-level Precision and Recall:
    Ragie significantly outperforms the benchmark in terms of both precision and recall. This means developers can rely on Ragie to retrieve highly accurate results consistently, ensuring high-quality performance in production. 
  1. High Recall Accuracy:
    With recall accuracy reaching 99.4%, Ragie excels in retrieving the relevant information, leaving minimal room for missed results. This ensures a highly reliable system for handling complex data queries.
  2. Efficiency at Lower k Values:
    Even with low k values, Ragie’s precision and recall are significantly better than the benchmarks. This capability allows developers to use smaller context windows, making production systems faster and more cost-effective without sacrificing performance.
  1. Precision Improves with Reranking:
    When the reranking feature is enabled, Ragie achieves a 124% improvement in precision, although it introduces a slight trade-off with recall. This enhancement ensures optimal performance in scenarios where precision is more important than recall.
  1. Hierarchical Retrieval Boosts Accuracy:
    Enabling hierarchical settings improves precision further by 26%. 
  2. Ragie is Highly Customizable:
    Our default setting is the best balance between precision and recall for most apps. However, developers can choose from hybrid, rerank, and hierarchical to pick the best for their needs.

Conclusion

Ragie’s performance against the LegalBench-RAG underscores its readiness for production use in legal and other domain-specific applications. We’re not entirely surprised by the evaluation results because we’ve seen Ragie deliver impressive results in real-world legal use cases. For example, one of our customers (Ellis, an immigration law firm) increased their legal drafting speed by 10x using Ragie's RAG system.