Validating Middle of Context in GPT-4-128K

Some empirical studies and experiments about language models’ use of long input contexts have found that language models often struggle to use information in the middle of long input contexts, and that performance decreases as the input context grows longer.

The paper Lost in the Middle: How Language Models Use Long Contexts ([2307.03172] Lost in the Middle: How Language Models Use Long Contexts) argues that in LLMs with large contexts, the “… performance is often highest when relevant information occurs at the beginning or end of the input context, and significantly degrades when models must access relevant information in the middle of long contexts.” By the end of the paper, they “conclude with a practical case study of open-domain question answering, finding that the performance of language model readers saturates far before retriever recall. Our results and analysis provide a better understanding of how language models use their input context and provides new evaluation protocols for future long-context models.”

We have also been seeing experiments that test the ability of GPT-4-Turbo 128K to retrieve a fact hidden in a document at different positions of the document and at different context sizes, an experiment called 'Needle In The Haystack", where a piece of information is injected to a document in a random location.

In the experiments I have seen, which can be found in X posts, the ‘Middle of Context’ problem seems to be ratified. In these experiments, the model could not retrieve information in large context when the context was over 60K tokens, and when the depth of the needle in the document was between 50% and 70%.

My initial issue with this test is that it is a known fact that the Attention mechanism blurs (averages) a little bit the context. But, when you hide one small sentence in the middle of a context of 70K to 120K tokens, and this fact is located at a depth between 30% and 70%, it is very possible that a retrieval can happen accurately.

We decided to execute this experiment ourselves. For that, we started with a hypothesis.

HYPOTHESIS:

If we hide two needles instead of 1, may be we get pinched. In other words: If we make this signal stronger, the model will be able to retrieve it.

With this hypothesis we created the required code to perform the experiments.

EXPERIMENT 1:

Process

One simple way to increase the strength of the signal of this needle in the middle of the haystack is by putting it twice. Other ways would be to insert a supporting fact around the first fact. Another would be to insert the needle or supporting facts at several depths.

Note: Our code and experiments are heavely based on the code found here: GitHub - gkamradt/LLMTest_NeedleInAHaystack: Doing simple retrieval from LLM models at various context lengths to measure accuracy

  • Step 1: Use of Paul Graham essays as ‘background’ tokens, similar to one of the experiments I mentioned above. These documents can generate over 120K tokens when concatenated.
  • Step 2: Place the following statement at various depths: “The best thing to do in San Francisco is eat a sandwich and sit in Dolores Park on a sunny day.”.
  • Step 3: Ask GPT-4 to answer this question only using the context provided using the OpenAI API: "“What is the most fun thing to do in San Francisco?”
  • Step 4: Evaluate GPT-4-Turbo-128K answer with another model (also gpt-4) using the OpenAI API.

The above process was repeated for all combination context sizes, from 60K to 120K in increments of 10K, and for different depths of needle location, from 20% up to 80% in increments of 10%. This resulted in a total of 49 experiments.

Main differences with the code I used as base code:

  1. The original code uses Langchain to run the prompts of the retrieval and the evaluation, while I use directly the GPT-4 API.
  2. My notebook focuses in the area the proved to be weaker in the results in GPT-4. I am excluding the ‘good’ areas as per the above experiments, namely: context from 60K to 120K and depth from 20% to 80%.
RESULTS EXPERIMENT 1

In this experiment we injected the needle twice. With a simple duplication of the needle, we were able to get 100% accuracy in the retrieval.

Having achieved a perfect score with the ‘2 needles’ strategy, we went back to the differences between other experiments and our experiments. Besides the 2 needles injected instead of 1, the other difference was: both previous experiments were done using a library, and not the OpenAI GPT-4 API directly.

EXPERIMENT 2

We decided to run the experiment with a single needle, using the GPT-4 API directly, as with the first experiment.

After setting up the script to inject just one needle and leaving the rest unchanged, we executed this second experiment.

RESULTS EXPERIMENT 2

The result with 1 needle injection: 100% retrieval

Both experiments executed find the needle 100% of the times in the ranges of context and depth tested.

Conclusion

The hypothesis and experiments aim to test the ability of large language models (LLMs) like GPT-4 to retrieve specific information embedded within a large context. The experiments are designed to challenge the assertion that LLMs struggle to access relevant information located in the middle of long contexts, as suggested by the paper “Lost in the Middle: How Language Models Use Long Contexts.”

This approach involves embedding a specific statement (“The best thing to do in San Francisco is eat a sandwich and sit in Dolores Park on a sunny day.”) at various depths within a large text body composed of Paul Graham essays. We then ask GPT-4 to locate this statement, or “needle,” within the “haystack” of background text, testing contexts ranging from 60K to 120K tokens and depths from 20% to 80%.

Key points from the experiments:

Two Needles Strategy: By duplicating the target statement within the text, we found that GPT-4 could retrieve the information with 100% accuracy, suggesting that reinforcing the signal (i.e., the target information) enhances the model’s retrieval capability.

Direct Use of GPT-4 API: Unlike other experiments that used LangChain, we directly employed the GPT-4 API. This might have influenced the results, as our method yielded a 100% retrieval rate even with a single instance of the target statement.

Contrast with Previous Experiments: Our results differ significantly from earlier experiments done by other researchers, where the model struggled with retrieving information in large contexts when it was located in the middle. This discrepancy suggests that the methodology, including factors like the use of LangChain or the nature of the embedded statement, might significantly impact the model’s performance.

In conclusion, these experiments suggest that GPT-4’s ability to retrieve specific information from large contexts can be significantly improved by reinforcing the target information, either by duplication or other means. Additionally, the direct use of the GPT-4 API seems to yield better results compared to methods using intermediary libraries. This indicates that the way information is presented to and processed by LLMs can greatly influence their performance in context retrieval tasks.

6 Likes

Great report, thank you! How many data points (e.g. different prompts with needles) as you mention straight 100%? Also, surely filtering the context using embeddings (RAG) would increase the recall?

Thank you for your note. The corpus was a concatenation of essays from Paul Graham.

I created context sized from 60K to 120K tokens and for each one of these I injected the needle at different depths (from 20% to 80%) and did the retrieval.

I used the same needle in all iterations: a sentence: “The best thing to do in San Francisco is eat a sandwich and sit in Dolores Park on a sunny day.”

100% of the retrievals were successful.

I have not done this using RAG. This was injecting the context and needle directly to the model using the OpenAI API.

Some people have used Langchain which would get closer to using RAG ,and their results have not been like mine. I wonder if it is something with the way these libraries build the prompts.

This is an important clarification. Only using essays from a single author and a single needle sounds like a significant methodic strength to me. It might be worth repeating the experiment with at least a couple of different corpora. Also, maybe the semantic similarity between the needle and the rest of the corpus plays a role?

Thanks for the feedback.

I am running experiments using now a corpus from financial documents.

Regarding the similarity of the corpus and the needle, my intuition is that the similarity makes it harder to find it, while a contrasting needle would be easier to spot. What do you think about this?

Yes, this would also have been my naive assumption. Keep us updated about your findings!

For those interested, here’s the updated article on the evaluation of long context in the new GPT-4-turbo-128k model.

I’ve added the ROUGE metrics for all experiments.

The results are impressive, with scores around 85% for most metrics.

Great read. Just last night I was reading up about Lost in the Middle.

Excited to see the results! Thanks for posting

1 Like

100% 𝐑𝐞𝐭𝐫𝐢𝐞𝐯𝐚𝐥 𝐫𝐚𝐭𝐞 𝐰𝐢𝐭𝐡 𝐚 𝐟𝐢𝐧𝐚𝐧𝐜𝐢𝐚𝐥 𝐧𝐞𝐞𝐝𝐥𝐞: 𝐂𝐡𝐚𝐥𝐥𝐞𝐧𝐠𝐢𝐧𝐠 𝐭𝐡𝐞 ‘𝐥𝐨𝐬𝐭 𝐢𝐧 𝐭𝐡𝐞 𝐦𝐢𝐝𝐝𝐥𝐞’ 𝐩𝐫𝐞𝐦𝐢𝐬𝐞

Yesterday we shared an experiment where we placed a sentence at several depths in several context sizes of a corpus and asked the LLM to retrieve it.

This is commonly known as ‘𝐬𝐞𝐚𝐫𝐜𝐡𝐢𝐧𝐠 𝐚 𝐧𝐞𝐞𝐝𝐥𝐞 𝐢𝐧 𝐚 𝐡𝐚𝐲𝐬𝐭𝐚𝐜𝐤’.

𝐓𝐡𝐞 𝐜𝐨𝐫𝐩𝐮𝐬 𝐰𝐚𝐬:
A collection of essays by Paul Graham.

𝐓𝐡𝐞 𝐬𝐞𝐧𝐭𝐞𝐧𝐜𝐞 𝐰𝐚𝐬:
“The best thing to do in San Francisco is eat a sandwich and sit in Dolores Park on a sunny day.”

𝐓𝐡𝐞 𝐪𝐮𝐞𝐬𝐭𝐢𝐨𝐧 𝐟𝐨𝐫 𝐫𝐞𝐭𝐫𝐢𝐞𝐯𝐚𝐥 𝐰𝐚𝐬:
“What is the most fun thing to do in San Francisco?”

The experiments used contexts from 60K to 120K tokens, and depths from 20% to 80%.

The retrieval rate was 100% in all combinations of context-depth.

Some readers mentioned that 𝐦𝐚𝐲𝐛𝐞 𝐭𝐡𝐞 𝐧𝐞𝐞𝐝𝐥𝐞’𝐬 𝐬𝐞𝐦𝐚𝐧𝐭𝐢𝐜𝐬 𝐰𝐞𝐫𝐞 𝐯𝐞𝐫𝐲 𝐚𝐥𝐢𝐠𝐧𝐞𝐝 𝐰𝐢𝐭𝐡 𝐏𝐚𝐮𝐥 𝐆𝐫𝐚𝐡𝐚𝐦’𝐬 𝐞𝐬𝐬𝐚𝐲𝐬 semantics.

Today I have repeated the experiment using the same corpus but a ‘needle’ from a financial domain.

𝐍𝐞𝐰 𝐧𝐞𝐞𝐝𝐥𝐞:
“In 2022, Amazon’s total consolidated net sales revenue amounted to 514 billion U.S. dollars, 118 billion U.S. dollars of which were generated through international revenue channels.”

𝐍𝐞𝐰 𝐪𝐮𝐞𝐬𝐭𝐢𝐨𝐧 𝐟𝐨𝐫 𝐫𝐞𝐭𝐫𝐢𝐞𝐯𝐚𝐥:
“What was Amazon’s 2022 revenue?”

This time I focused on the larger context (90K to 120K) and in the middle of the context defined by depths between 30% and 70%. I would call it “𝐭𝐡𝐞 𝐩𝐮𝐫𝐞 𝐦𝐢𝐝𝐝𝐥𝐞 𝐨𝐟 𝐭𝐡𝐞 𝐥𝐚𝐫𝐠𝐞𝐫 𝐜𝐨𝐧𝐭𝐞𝐱𝐭”.

𝐓𝐡𝐞 𝐫𝐞𝐬𝐮𝐥𝐭𝐬? 100% Retrieval Rate again.

𝐀𝐧 𝐞𝐱𝐚𝐦𝐩𝐥𝐞 𝐨𝐟 𝐭𝐡𝐞 𝐫𝐞𝐬𝐩𝐨𝐧𝐬𝐞 𝐠𝐢𝐯𝐞𝐧 𝐛𝐲 𝐭𝐡𝐞 𝐋𝐋𝐌:
“Amazon’s total consolidated net sales revenue in 2022 amounted to 514 billion U.S. dollars.”

Details of the original experiment and a link to the notebook can be found in my first post on this thread (the first post of the thread).

Also, the log for this specific experiment with the financial needle on Amazon’s revenue can be found here:

I would encourage you to take this code and run experiments in your own domain and maybe in open-source LLMs.

3 Likes

Great work here! It’s wonderful to see empirical work being done with these new models.

I think it would be interesting to see this done with a non-repeating corpus and a non-injected needle. How well does it find a sentence in a non-repeating corpus selected at random from the source text?

Thank you for your note. This is not a non-repeated corpus. The concatenation of Paul Graham’s essays goes easily beyond the 128K context. It is true that all comes from the same author, but all documents in the context are always different.

I have not done the experiment with retrieving information from the corpus directly (no needle) but I think it is a good next experiment.

I did test asking the ‘needle’ question with no needle injected and the result was positive: the LLM said it could not find the requested information. So we can discard randomness.

1 Like

Very interesting! Even without a no needle test it looks more promising than the missing middle found by the original study. I wonder what they did that caused the problem.

Great report, thanks a lot!
You mentioned that the retrieval breaks around 60k up to 120k, do you have any information about 60k or less?
Will information retrieval be stable at, say 59k tokens?
Is there a noticable decrease in accuracy leading up to 60k, so that it might be better to chunk into smaller context blocks, maybe 30k or something like that?

Thank you for your note.

I am reporting that the retrieval never broke in my 60k+ context sizes, at depths from 20% to 80%.

I didn’t test below 60K.

2 Likes

Oh okay, I thought you also reproduced their results with Langchain (or whatever other mistake they made). I might look into it myself, because at the context size we are now at, I would gladly cut down the chunk size for perfect accuracy, even if the 60k+ accuracy is already 99.9% and below 60k gets to 99.99%…

I am also wondering if the needle-in-haystack approach is an accurate measure for the models capability to extract more detailed information, since this approach generally just needs an exact match for some keywords that indicate the correct location in the context… if the model tries to answer more indirect questions, the results might look different.

I will report back if I try some stuff out

Thanks Vincent.

The original experiment was using LlamaIndex and it showed poor performance on long context. My goal was to re-create the exact experiment but using the API directly, and not through a wrapper like this.

My result, as shared, was 100% retrieval rate.

Regarding using a less obvious question to retrieve the needle, that is an experiment that I will do as well. I will share results.

Thanks for the follow up!

Juan