Ragie on “RAG is Dead”: What the Critics Are Getting Wrong… Again

Blog

Opinion

•

April 14, 2025

Ragie on “RAG is Dead”: What the Critics Are Getting Wrong… Again

Bob Remeika

Co-founder & CEO

The release of Llama 4 Scout, with its 10 million token context window capable of holding over 13,000 pages^[1] of text, has reignited the "RAG is dead" debate online… again. This debate makes a regular appearance every time a new long context window LLM is released, so this time I thought I would try to dive deep and explain why I don’t think “RAG is dead” and why, we at Ragie, think it will be relevant even as context windows continue to get longer.

While it is true that RAG may not be required like it was for some use cases when applications were limited by tiny context windows, it’s still a useful tool for applications that need to access large amounts of data, in different modalities, at scale. It’s also required if you factor in the latency, cost and accuracy issues introduced by running inference on long context windows.

Let’s dive into this by taking a look at a couple of different RAG use cases: one where RAG is no longer required and one where RAG is still needed. Then we’ll break down the latency, cost and accuracy tradeoffs that you may experience when using long context windows.

‍

‍‍Use case: Chat with a document

When ChatGPT 3.5 was released, the context window size was 4,096 tokens. That translated to about 3,000 words or 5 pages^[1] of text. If you wanted to chat with a large PDF, chances are, you needed to use RAG to do anything useful.

Since then, context windows have expanded and LLMs like Gemini 2.0 Flash (2M tokens) and Llama 4 Scout (10M tokens) have increased them dramatically.

With long context window LLMs, you can now pass the entire document into your context window without splitting it into chunks and selecting a subset. This wasn’t possible when context windows were tiny. So for this use case, RAG is no longer required (although it might still be useful).

One caveat here is that dumping large amounts of tokens into the context window can be slow, pricey, and give LLMs unrelated context that leads to hallucinations. So even though your corpus of text is relatively small, you still might want to consider RAG for more targeted context.

Now let’s look at a case where RAG is still needed.

‍

Use case: Chat with your knowledgebase

Most real world knowledgebases contain at least megabytes of data on the low end and gigabytes or terabytes of data on the high end. Building a chatbot to chat with large knowledgebases isn’t possible using the context window alone because there is just simply too much data.

I sampled Ragie customers that have uploaded at least 100 files, and all of them have knowledgebases that exceed 10M tokens by almost 10x. The total token size of the largest 10% of knowledgebases in Ragie exceeds this limit by over 65x and the largest 1% exceeds this limit by nearly 1000x.

The main difference between this use case and the last is scale, and that’s why RAG is still required for many real world applications. If you want to build chatbots or report builders or agents or any kind of AI application with a large amount of data, you still need RAG.

‍

Latency, cost and accuracy

Long context windows are great because they open up the capability to give LLMs more data and more relevant context than was previously possible, but they aren’t a replacement for RAG. Before anyone starts stuffing 10M tokens in their context windows, they should consider how this affects latency, cost and accuracy.

‍

‍

Latency

Inference on long context windows is slow and requires a lot of memory. As of mid-February this year (2025), users were reporting inference latency of over 30 seconds with Gemini 2.0 Flash on 360K tokens and up to a minute for 600K tokens.

To contrast, RAG generations that retrieve context from Ragie on a corpus of 20,000+ documents, over 3M vectors, and combine results from multiple indexes can generate completions using OpenAI and Anthropic models in about 1 second. Your mileage may vary depending on network latency and other variables, but RAG performance is much faster with the retrieval step typically completing within 300-600ms.

I tried to run a few different tests to see what kind of inference performance you can expect with a full 10M token context window on Llama 4 Scout but I wasn’t able to run anything successfully yet. My first attempt was to run a completion using Groq, but their API blocked my large request payloads (40MB). I also tried running my own inference tests on an H100 with 80GB of VRAM but both of my tests at 1M and 10M tokens failed due to running out of memory. In order to complete a test, we’ll need to run it on specialized hardware with a massive amount of VRAM, but that sounds pretty expensive which is a great segue into cost.

‍

Cost

Processing millions of input tokens is significantly more expensive than using RAG to narrow your context windows.

Let’s do some quick math: At $0.11 per million input tokens you are looking at $1.10 per inference on Llama 4 Scout with 10M input tokens. At 100 inferences with a full context window, we would spend $110 on just input tokens, but if we preprocess our context with RAG, we can dramatically decrease the inference costs.

Using the RAG approach with a top_k of 10 and an expanded chunk size of 1,000 tokens, we’re looking at a total size of 10,000 input tokens at a cost of $0.0011 per completion. 100 inferences using the RAG approach will cost you $0.11 in input tokens.

Using RAG instead of using a 10M token context window is 1000x cheaper.

‍

Accuracy

LLMs work best when they are only given the relevant data needed for a completion. Data passed to context that is not relevant can cause hallucinations so stuffing 10M tokens in a context window might not yield expected results.

Llama 4 Scout did provide results from their Needle in Haystack (NiH) tests that looked very promising but it’s not clear if NiH performance is a good indicator that completions will have high accuracy. It’s only an indicator of successful recall.

Contrast this with a RAG approach where only relevant chunks are retrieved and provided to the context window, which can be further optimized with re-rankers, and you are more likely to get more accurate completions because the LLM is only provided data that is relevant to the final generation.

‍

Final Thoughts and Predictions

Context windows will continue to get longer, but the size of knowledge bases and available data for AI applications will continue to outpace the size of the context window (just like the size of hard drives outpaces the size of RAM in a typical system).

Applications will take advantage of long context windows by providing more relevant context to LLMs but a retrieval step will still be needed as it impractical (and in many cases impossible) to stuff all of your data into the context window for latency, cost and accuracy reasons.

So although long context windows are where we are headed, and there will be continued innovation with long context window models, long context window models won’t replace RAG. If anything, they’ll “augment” it.
‍

‍

References:

This estimate uses Ragie’s page calculation which is based on 3000 characters or about 750 tokens per page.