Performance of GPT-4o on the Needle in a Haystack Benchmark

Keep in mind that this test in specific does not mean that it can actually recall just random documents that you’re given it and that it will recall all that context. Almost all of them, after 8,000 or so tokens, they start making mistakes.

We have tried literally millions of permutations with different files, KBs, that have 10,000 files from all kinds of different file types and different contexts, and so we’ve seen it so many times that for us 8,000 tokens seems to be the reliable limit for these LLMs.