GPT-4 has a context window of 128,000 tokens, while Gemini boasts a staggering 2 million. Although these numbers are exciting, the reality is somewhat different. You may have observed, as I have, that the quality of reasoning by LLMs tends to falter with lengthy inputs—a phenomenon that current evaluations fail to adequately capture.
The prevalent benchmark for assessing LLMs’ handling of extensive context windows is the “Needle in a Haystack” test, which is primarily focused on locating specific pieces of information within a text. This is quite distant from real-world applications.
After experiencing the degradation of reasoning with the increase in the context window size in practice, I decided to delve deeper into exploring and measure this decline in performance.
I created 4 tests. The first is called ‘Find the Origin,’ which examines the LLM’s capacity to trace the source of a connection across a network of vertices, varying the number of irrelevant vertices in the system as well as the distance and order between the origin and destination vertices. Example prompt:
Several words below are interconnected. For example:
”X” is connected to ”Y”
”Y” is connected to ”Z”In this scenario, the origin of ”Z” is ”X”. We can visualize these connections as vertices
and edges, like this:
X"–> “Y”–> “Z”Using this logic, consider the following list of connections, where each word is simply the
name of a vertex with no other semantic meaning:”precious” is connected to ”rejected”
”rejected” is connected to ”appropriations”
”workbench” is connected to ”dad”
”dad” is connected to ”admire”
”bushes” is connected to ”tight”
”tight” is connected to ”partner”Your task is to find the origin of ”admire”. Work carefully, step by step. Your final answer
must be in this format: FINAL ANSWER: YOUR ANSWER
We can increase the size of the context window by introducing more vertices while main-
taining the difficulty level.
The difficulty level (complexity) of the Find the Origin test may depend on:
A) Number of connections. In the example above, each origin vertex reaches a destination
with only two connections.
B) Distance between connections. In the example above, the destination vertex ”admire” is
immediately below the origin vertex ”workbench”. To quantify this relationship, a distance
parameter d is defined, with d=1 in this scenario. If the destination vertex is positioned on
the second line below the origin vertex, d=2.
Example of Find the Origin with d=2:
…
”precious” is connected to ”rejected”
”workbench” is connected to ”dad”
”rejected” is connected to ”appropriations”
”dad” is connected to ”admire”
…
C) Order between connections. When the destination vertex is below the origin vertex, as
in the first example, it results in a positive d value. Conversely, if the destination vertex is above
the origin vertex, d is negative. Example of Find the Origin with d=-1:
…
”rejected” is connected to ”appropriations”
”precious” is connected to ”rejected”
”dad” is connected to ”admire”
”workbench” is connected to ”dad”
…
When varying the test parameters, the performances are neither consistent nor predictable. The GPT-4 Turbo model achieved the highest consistency, but it also showed significant degradation with the increase in the context window (n_tokens) and the distance between relevant information (parameter d).
Here are the outcomes for d=15, with context windows ranging from 287 to 10,029 tokens:
For the reverse sequence of information (d=-15):
Here are the isolated performances of various models at different d values:
Noteworthy observations from these trials include the suboptimal performance of the GPT-4o model across all configurations, the instability of the Gemini models when increasing the magnitude of the parameter d, and the asymmetry of the Sonnet 3.5 model concerning the sign (positive or negative) of parameter d.
I have made all the code and documentation for this test available on GitHub for those interested: https://github.com/natanaelwf/LLMTest_FindTheOrigin
To further illustrate the decline in reasoning as the context window expands, I employed three additional tests:
-
Highlight Inefficient Code: I placed two highly problematic and inefficient Python functions amidst some code and asked the LLM to evaluate the Whole code for inefficiencies, thereby gauging its capability for meticulous analysis.
-
Decrypting Cryptography from a Clue: At the tail end of a text, I inserted an encrypted message, while subtly embedding a clue for its decryption at the beginning. This was done without providing further instructions or explicit prompts, challenging the LLM to engage in reasoning driven by curiosity.
-
Unlock $100.00: In the midst of a text, I placed a statement promising $100.00 contingent upon a specific response to the text. This scenario was crafted without additional context or extra request to the LLM, aiming to evaluate the model’s aptitude for discovery-based reasoning.
All the details of these tests are documented in this paper: Challenging LLMs Beyond Information Retrieval: Reasoning Degradation with Long Context Windows
I find it particularly relevant to explore how intelligent a model is when provided with large inputs. The popular benchmarks, including those used in competitions like lmsys, do not capture this aspect.