Reasoning Degradation in LLMs with Long Context Windows: New Benchmarks

GPT-4 has a context window of 128,000 tokens, while Gemini boasts a staggering 2 million. Although these numbers are exciting, the reality is somewhat different. You may have observed, as I have, that the quality of reasoning by LLMs tends to falter with lengthy inputs—a phenomenon that current evaluations fail to adequately capture.

The prevalent benchmark for assessing LLMs’ handling of extensive context windows is the “Needle in a Haystack” test, which is primarily focused on locating specific pieces of information within a text. This is quite distant from real-world applications.

After experiencing the degradation of reasoning with the increase in the context window size in practice, I decided to delve deeper into exploring and measure this decline in performance.

I created 4 tests. The first is called ‘Find the Origin,’ which examines the LLM’s capacity to trace the source of a connection across a network of vertices, varying the number of irrelevant vertices in the system as well as the distance and order between the origin and destination vertices. Example prompt:

Several words below are interconnected. For example:

”X” is connected to ”Y”
”Y” is connected to ”Z”

In this scenario, the origin of ”Z” is ”X”. We can visualize these connections as vertices
and edges, like this:
X"–> “Y”–> “Z”

Using this logic, consider the following list of connections, where each word is simply the
name of a vertex with no other semantic meaning:

”precious” is connected to ”rejected”
”rejected” is connected to ”appropriations”
”workbench” is connected to ”dad”
”dad” is connected to ”admire”
”bushes” is connected to ”tight”
”tight” is connected to ”partner”

Your task is to find the origin of ”admire”. Work carefully, step by step. Your final answer
must be in this format: FINAL ANSWER: YOUR ANSWER

We can increase the size of the context window by introducing more vertices while main-
taining the difficulty level.

The difficulty level (complexity) of the Find the Origin test may depend on:

A) Number of connections. In the example above, each origin vertex reaches a destination
with only two connections.

B) Distance between connections. In the example above, the destination vertex ”admire” is
immediately below the origin vertex ”workbench”. To quantify this relationship, a distance
parameter d is defined, with d=1 in this scenario. If the destination vertex is positioned on
the second line below the origin vertex, d=2.

Example of Find the Origin with d=2:


”precious” is connected to ”rejected”
”workbench” is connected to ”dad”
”rejected” is connected to ”appropriations”
”dad” is connected to ”admire”

C) Order between connections. When the destination vertex is below the origin vertex, as
in the first example, it results in a positive d value. Conversely, if the destination vertex is above
the origin vertex, d is negative. Example of Find the Origin with d=-1:


”rejected” is connected to ”appropriations”
”precious” is connected to ”rejected”
”dad” is connected to ”admire”
”workbench” is connected to ”dad”

When varying the test parameters, the performances are neither consistent nor predictable. The GPT-4 Turbo model achieved the highest consistency, but it also showed significant degradation with the increase in the context window (n_tokens) and the distance between relevant information (parameter d).

Here are the outcomes for d=15, with context windows ranging from 287 to 10,029 tokens:

For the reverse sequence of information (d=-15):

Here are the isolated performances of various models at different d values:




Noteworthy observations from these trials include the suboptimal performance of the GPT-4o model across all configurations, the instability of the Gemini models when increasing the magnitude of the parameter d, and the asymmetry of the Sonnet 3.5 model concerning the sign (positive or negative) of parameter d.

I have made all the code and documentation for this test available on GitHub for those interested: https://github.com/natanaelwf/LLMTest_FindTheOrigin


To further illustrate the decline in reasoning as the context window expands, I employed three additional tests:

  • Highlight Inefficient Code: I placed two highly problematic and inefficient Python functions amidst some code and asked the LLM to evaluate the Whole code for inefficiencies, thereby gauging its capability for meticulous analysis.

  • Decrypting Cryptography from a Clue: At the tail end of a text, I inserted an encrypted message, while subtly embedding a clue for its decryption at the beginning. This was done without providing further instructions or explicit prompts, challenging the LLM to engage in reasoning driven by curiosity.

  • Unlock $100.00: In the midst of a text, I placed a statement promising $100.00 contingent upon a specific response to the text. This scenario was crafted without additional context or extra request to the LLM, aiming to evaluate the model’s aptitude for discovery-based reasoning.

All the details of these tests are documented in this paper: Challenging LLMs Beyond Information Retrieval: Reasoning Degradation with Long Context Windows

I find it particularly relevant to explore how intelligent a model is when provided with large inputs. The popular benchmarks, including those used in competitions like lmsys, do not capture this aspect.

11 Likes

That’s intriguing. I wonder why Sonnet 3.5 and Gemini perform better with negative d values?

Thanks for sharing what you found.

1 Like

This is something that warrants further investigation, but I believe it may be linked to how models are influenced by positional encoding.

1 Like

How about GPT-4o-latest? Has there been any improvement in these benchmarks? I haven’t particularly noticed much difference so far working with short prompts. It would be good to know how it behaves with long prompts.

1 Like

It’s still on my to-do list, but I’m hoping to get to it soon and will definitely post an update here once I do.

The thing to understand is that these models are not capable of actually reasoning. They can fake reasoning to a degree by memorizing massive amounts of data with a wide distribution of patterns. So while it may look like they’re reasoning they’re just matching the task and data they’re shown to patterns they’ve been trained on. If you look at your results in that light it’s a bit easier to see what’s going on.

For cases where the distance is too many steps that just means that haven’t been trained on patterns with that many steps. The pattern needed to answer the question falls outside the model’s distribution.

The reason why token distance matters is that the more tokens you put between the individual steps of the pattern the more likely something is going to break the pattern.

We run a multi-needle in a haystack test that randomly distributes 20 unique passwords across a corpus of varying lengths and we see the same issues. Our task is easier compared to yours but the results are the same. The models are generally good at retrieving all the passwords if they’re clustered close together but as they spread out all of these models start to lose track of the passwords.

Here’s what we see for GPT-4o. I don’t have the heat maps handy for Claude 3.5 Sonnet and gpt-4o-mini but they’re similar. Claude is a little better (it scores like 8.3) and mini is worse (like a 6.1 or so)

4 Likes

It’s worth noting that we have an algorithm which generally solves the token distance problem. Here’s the same 20 password multi-needle test using our distributed reasoning engine:

Our algorithm solves this lost in the middle problem by creating a chain of thought that physically moves the information needed to generate a response closer together distance wise. I would expect our engine to do fairly well on your test in that it should remove the effects of distance but again these models can’t reason so if the pattern needed to answer the question falls outside the models distribution there’s nothing you can do.

5 Likes

I find it interesting how this generalization capability is associated with the context window. There are moments when the loss occurs very early. In my test ‘Decrypting Cryptography from a Clue,’ I introduced a cryptography and a clue. This is the prompt with 214 tokens:

GPT-4 Turbo was able to grasp everything and solve the puzzle:

After increasing the prompt from 214 to 354 tokens by adding more passages from the book, the model still recognizes the presence of cryptography and a clue in the text but is unable to reason its way to a solution:

When increasing the amount of text from the book to 6000 tokens, the model becomes blind and doesn’t even perceive that there is cryptography and a clue in the text. The investigative capability is completely lost.

This serious problem is barely addressed. OpenAI and other companies have announced increases in the context window, ignoring the fact that LLMs are extremely dumb with long inputs.

3 Likes

I’m sure they would improve the situation if they could. This is likely just another thing about LLMs aren’t completely sure why it happens. Even in our case, the fact we improve the models reasoning with long inputs, is an unexpected side effect of our reasoning algorithm. I designed the algorithm to let me create an infinite length context window. The fact it also helps avoid “lost in the middle” issues was a happy accident.

1 Like

Interesting. Do you have any paper or a GitHub related to this algorithm?

Unfortunately we’re building a business around this algorithm so it’s currently secret sauce. I’m sure we will release more details over time but I’m not sure when.

2 Likes

If you want something now to play around with, check out the recently open sourced GraphRAG from Microsoft.

https://microsoft.github.io/graphrag/

They form a knowledge graph from your large data set using the LLM, and then perform inference across the graph as well.

Out of the box, you give it your OpenAI API key, and it does the rest for you.

Maybe @stevenic or others can shed some light on this, I’d be interested in how automated knowledge graph generation and inference compares to these other deeper inference techniques that go beyond vanilla RAG.

So essentially you throw long context out the window (pun intended), and form a graph out of the large context instead, and do some deeper stuff on the graph directly … and lots of smaller context API calls.

2 Likes

I view it like this… if you want the best chance at an accurate answer you need to show the model the entirety of the corpus you’re reasoning over. Anything that removes information increases the likelihood of an inaccurate answer. RAG and all its variants (including GraphRAG) remove information from the corpus which increases the chance the model won’t see the information needed to answer a question or a complete task.

My personal focus has been around developing techniques that make it practical to show the model the entirety of a corpus. This includes techniques for virtualizing the context window and reducing lost in the middle issues. Per token costs are falling like crazy so my focus is squarely on just reasoning over everything as that guarantees the best possible answer.

In my view vector databases and semantic similarity algorithms are point in time solutions that soon wil be less needed

3 Likes

I agree on the notion that often you need to have a view of the full context to complete certain tasks or provide an answer that is more holistic.

I do think though that there could still be hybrid solutions that involve semantic similarity.

Say you have a corpus of text and the content required to answer a question really just sits in three different sections, then ideally you want to avoid having to show the model all the context. So I am hoping to find new and smart ways to leverage semantic similarity to first systematically segment the corpus at the point of ingestion to narrow down the parts that you eventually have to run past the model. I am not saying I have found a replicable approach for achieving that just yet but it is something I am actively looking at in the context of one of my solutions.

I’ve been experimenting a lot with semantic segmentation but the challenge is certainly to create an approach that you could replicate for any sort of question or analysis.

2 Likes

Great work. Thank you a lot.

IMHO the tests may be not the optimally adapted to be used with LLM as personally I think the task relies on LLM capacities outside of primary purpose (language operations vs reasoning and reflection). But those tests are already way better than most of what’s available out there to address this thing.

But hell, yeah, the bigger the input in the window, the harder to keep same level of focus on the task. Don’t see how to bypass this if not by simplifying the tasks and splitting it into sub-tasks/workflows.

1 Like

There’s definitely plenty of uses for search still… both semantic and more traditional keyword search. You’re not going to reason over the entire web every time you want to ask the model a question…

Here’s a quick screen grab of the CLI version of our engine running a feature we call smart search which is essentially our version of Search GPT. We do a web search and then to cut the search space we show the model the SERP and let it pick the pages it wants to read for the answer. In this case the SERP we show the model has 10 pages but it only chose 3. We then reason over those 3 pages for an answer. In this case it looks like that was 17k tokens but it could have been 1.7m tokens. We would have reasoned over as many pages as the model wanted.

4 Likes

Does this web query depend on search engines like Google/Bing?

It’s using Bing currently but would work with any search engine that can generate a text snippet (pretty much all of them.)

This situation seems somewhat different from the general problem. When performing a web search, each page is independent, with attributes like Title, H1, description, and so on. This makes it relatively straightforward to execute the query in two steps: first, by identifying the most promising content, and then by processing these selected pieces separately to form a response. I developed a system using Selenium to do this on Google, working through several stages to analyze the top 40 search results.

However, the challenge of analyzing a large document is different when the content isn’t pre-structured into independent sections. Do you have any other examples?