Reasoning Degradation in LLMs with Long Context Windows: New Benchmarks

Indeed, it is difficult to create a test that involves large context windows similar to real-world problems. However, the “Highlight Inefficient Code” test that I describe in the paper is a very practical and common example in the programming field.

*By the way, I tried to submit the paper to ArXiv to promote more activity in the creation of these benchmarks, but since I am not affiliated with any institution, I need to be “endorsed” by someone on the platform. This is the standard form I received:

Natanael Fraga requests your endorsement to submit an article to the cs.AI section of arXiv. To tell us that you would (or would not) like to endorse this person, please visit the following URL:

https://arxiv.org/auth/endorse?x=7WIRSX

We don’t expect you to read the paper in detail, or verify that the work is correct, but you should check that the paper is appropriate for the subject area.

If anyone from this community would be willing to endorse me, I would be grateful.

One thing that’s been stirring around my brain today is that what you are doing sounds a lot like creating a knowledge graph (maybe not recording it for future RAG reference), but you are building some sort of graph with all the agentic feedback going on.

So a GraphRAG approach might be a snapshot of your process, frozen in time, perhaps suboptimal compared to a “live” version, but still a good approximation of the huge amount of AI chewing done in your case to create an answer.

Also, a hyper-parameter of these graphs is actually the prompts used. So each graph produced can have a certain lens/projection depending on the prompts used to create the knowledge graph.

I’m wondering if you were to create a graph out of your stuff, using, say the Leiden algorithm, you would get a snapshot of your process that you could rewind and traverse, gather more information, without refocusing the agents from scratch.

Of course this snapshot is taken from whatever lens you are carving with the prompts. So you can have different flavors of graphs to traverse, depending on the prompts used to create them.

The main reason this might be useful is that after you do this linking (graph/nodes/edges) once, you get much faster follow-on latencies. So it’s more of a pre-processing step that gets you close to a full agentic session answer, without all that wait time and resource expenditure.

That’s just my initial impression of these LLM-built knowledge graphs compared to what you are doing.

2 Likes

A couple of things… I know graphs are powerful but I’m not a graph guy so I don’t think about the problem that way. I spent many years working on the windows shell team (I designed the list view virtualization algorithm for the file explorer) so my tendency is to always start from the file system and work my way up from there complexity wise. TLDR; I’m currently just doing everything as files on disk.

The corpus sizes I want to work with are massive… potentially terabytes or even petabytes in size. My goal is to be able to reason over every scientific paper ever written. There’s no way that much data would ever fit into memory and even a graph database could get unwieldy.

I spent today having my engine read 830 machine learning papers. It took 6 hours, 6.3k gpt-4o requests, 27 million input tokens, and 2.5 million output tokens (91% compression) total cost… $15 :slightly_smiling_face:

The next reasoning steps will get smaller and faster as I move data through the funnel but it’s still big so I currently just dump out a bunch of files for the output.

These 830 papers are a toy sized corpus for me to vet my ideas with. Once the system is doing what I want I’m going to download all of arxiv.org (3 TB) and throw my engine at every paper that’s been published to their site.

Even with this first pass there’s some interesting observations the model is making about the papers. The next passes with start to connect dots between papers and identify potential gaps that I could train an AI Scientist agent on.

3 Likes

Cool project!

But don’t give up on graphs!

In your files on disk situation, your graph would be metadata that links different files together, these are the edges. You can have multiple edges going from one file to another, and your edges can even have a direction, so file A points to file B on some condition. Note: Graphs don’t have to live in memory!

Once you start linking the files together with this metadata, you have a graph! I don’t think the overhead is insane, maybe 10% for all the metadata compared to all your content?

Your agents can even create these relations. And you can have multiple lenses (edges/file-connections from different agents) on the same foundational data…

I think what is overwhelming is examining the combinatorics of all possible edges that exist, but this will never happen since your agents are inferring relations in real time, and don’t have infinite time/bandwidth to examine the entire space.

So all I am talking here is adding a metadata layer to your existing data. Then you can let the LLM leverage this in the future to make further progress.

You can think of this metadata as past traces or “viewpoints” of previous agents. It would be interesting to see how different LLM’s view the same data. Or how prompts shape the views within an LLM.

2 Likes

Curt can you give an example of what you think would be in the file on disk versus in the graph metadata? Doesn’t have to be related to my papers project. I just want to visualize what you’re suggesting.

For the papers project, part of the lens/projection classifies each paper into one of 20 categories. All of my lenses have a similar structure so I wrote a bit of code that lets me do simple filtering on the properties within a projection. I can basically use code to scope a query to only the files in a specific category.

I’m just wondering if those are the types of edges you’re thinking would be in the graph? I had thought about promoting those properties out to a sql lite file but it’s simple enough to just filter them in place at query time

1 Like

This is exactly the reason I came to these forums. I have been attempting to use gpt as a writing assistant, and have tried several of them with the same results. After an hour or so of talking with the gpt, it begins to forget critical data, or will confuse the links previosly established.

In the case of a writing assistant, things like the list of characters in the book and how they relate to one another will get corrupted, or just outright forgotten. This makes using gpt as a real tool impossible. There must be a way to flag these connections that I deem important such that the gpt can periodically go back and refresh what we have duscussed.

There needs to be a way to save this critical data that forms the basis of the work on a medium that I control. So either my local device or a cloud store I control. This data needs to be saved in a logical manner, so if I ask the gpt to make changes to a character it can refresh the saved info on characters, even if that means forgetting things about places temporarily due to limited short term memory.

This all needs to happen seamlessly so the user is not constantly trying force the gpt to remember this data. If i am forced to save the chat to a file manually, and then try to feed important bits back in just to refresh a character list, and thus causeng the gpt to forget all the information on places, then the “Tool” is useless and nothing more than a toy.

It is critical the gpt understand either automatically, or via user input, that it needs to refresh some part of the short term memory, and do so with minimal user intervention. A prearranged command to refresh a certain branch of the data tree, or even just a plain voice/text promt that indicates something is missing. Causing the gpt to evaluate what branch of the data tree to refresh and possibly even what branch it may need to purge if short term memory is an issue.

I know this is not really about your post, so sorry for dumping ths wall of text, but your post hit the nail on the head for what I as a writer am experiencing. I just hope someone with the knowledge and abilities to do something with this maybe reads this post.

When extracting content from papers, one of the biggest challenges is table formatting, which often isn’t extracted correctly. I’ve tested numerous open-source and paid libraries, but none have been good enough.

Have you developed a specialized method for this extraction, or are you using an off-the-shelf solution?

2 Likes

An example relation (edge in the graph) is what you have already alluded to, which is taking a document, or chunk of text, and classifying it to one of your 20 categories.

Another one is an edge that goes from a document, to a summary of the document. This edge can also contain metadata on the model used to make the summary, and its configuration, including prompts used. This way you can backtrack, and say, “delete” these edges in the case where you later found out your model did a bad job.

In general, edges are from documents, or whatever your chunks are that make the nodes, and logical relations to other chunks (derived by your agents, or some other algorithm). So documents to documents, documents to summaries, documents to a certain “lens perspective”, documents to classifications, etc.

As for in-place (on disk) vs database, in your situation, with terabytes of data, I have had better success in creating metadata files that would contain each type of edge relation. These files are also on disk. For example, for your 20 categories, you have a metadata file that lists the document identifier, along with the category it corresponds to. The you just read in this small file, read in whatever category you want, and then can stream all files of that category into your next pipeline stage from disk.

You could do it in a database too, but it might be overkill, and maybe not canonical in your situation where everything is on disk, but it’s up to you.

1 Like

I’m ignoring tables, formulas, and even images for the moment. Lots of room for improvement there. Fortunately, a lot of the conclusions the readers expected to draw from things like images and figures can be found in the text. There’s a lot of duplication in papers and it’s rare that an author will present data in a table or image without also pointing out the key observation later they want the reader to make… there are outliers of course.

The other thing to note is that the level I’m currently working at I’m trying to get the model to simply draw inspiration from hundreds of papers so the fine grained details of individual tables aren’t important.

The LLM has been trained on every paper on the planet already so I’m actually just exploring how I can steer it. My theory is that by showing it smaller subset of really good ideas you can get it to build on those ideas in a more focused way then what it can do without the added grounding.

If you think about it, any idea that’s expressible via language is theoretically expressible by an LLM. It’s just a matter of getting the model to generate the right sequence of tokens. The way you get the model to go down in explored paths from a token generation perspective is via chain of thought so what I’m in essence trying to do is to get the model to construct novel chains of thought using scientific papers as inspiration.

2 Likes

I agree. However, the problem remains for me because LLMs struggle to reason effectively with lengthy inputs. Even when we reduce the token count by using very concise summaries and focusing only on key information, the amount of data needed for proper reasoning still poses a challenge. LLMs often fail to focus on just a few concepts within each prompt. In other words, even a short prompt (300 tokens) packed with a lot of information can easily confuse the LLM. I’m hopeful that GPT-5 will bring a significant improvement in this area.

1 Like

Can you DM me some examples of short densely packed prompts that the LLM struggles with? I’d just like to better understand what’s happening.

These models have a natural tendency to want to summarize their answers. The entire mechanism by which they work is more of a compression engine than anything. They can’t actually reason so if they’re not transforming something in the input to the output there’s a reason for it.

They often drop information but it’s usually because of their desire to summarize l.

1 Like

The prompt below has only 200 tokens:

A group of people is taking part in a physical education study. Various data were collected about these individuals, and by analyzing their ages, it was possible to establish the following relationships:

John is older than Mary. John is younger than Tom. Sarah is younger than Tom. Bob is older than Mary. Joseph is older than Kate. Olivia is older than Amelia. Joseph is younger than William. Bob is younger than Daniel. Amelia is younger than Emma. Kate is older than Sarah. Joseph is older than Mary. William is older than Mary. Sophia is older than Kate. Emma is younger than Tom. William is younger than Tom. Sophia is younger than John. Olivia is younger than William.

The study aims to measure the effects of an exercise on older individuals, and there is limited space available, so only the oldest person from this list can participate at this moment. You need to select the oldest person. If it is not possible to determine, just list the candidates for the oldest person. Work carefully.

GPT-4o-latest, GPT-4 Turbo, Gemini-1.5-pro-exp-0827 all incorrectly state that Tom is the oldest.

*The only one that got it right was Sonnet 3.5, saying that the candidates are Tom and Daniel.

Note that this is a common example of condensing information. A large file could be reduced to these 200 tokens of essential information, but the LLM would struggle to reason about it due to its high density.