Performance of GPT-4o on the Needle in a Haystack Benchmark

I’m curious if anyone has tested the new GPT-4o model using the “Needle in a Haystack” benchmark. If not, perhaps the developers could share their insights on how it might perform in comparison to the GPT-4 Turbo. Any feedback or data would be greatly appreciated!

posted a few here

For your query -

but also note

I find the latter quite curious. It echo’s my openai-eval PR around multistep math problems, such that it has to provide the answer and no other text.

1 Like

Here we go for the needle in haystack

Very curious result. Seems to conflict with the above result from

This is what I was looking for, thank you!

At the end of the day, when it comes to needle in a haystack, Claude 3 Opus still outperforms gpt4o, no?


you mean really underperforms? check out the benchmark. GPT-4o incomparably better

Opus beats it still according to those benchmarks. Omni beats sonnet though, which is not at all surprising.

Which Claude 3 opus needle in haystack benchmark are you referring to, where it shows underperforming results in comparison to gpt4o needle in haystack, care to share the source?

From what I’ve seen, opus shows better results, over a larger context window too

Here you need to open the image and turn the page to the second one.

In that post, it’s using Sonnet as the comparison, not Opus

This is the only benchmark I could find on Opus, which is in a different format unfortunately


Keep in mind that this test in specific does not mean that it can actually recall just random documents that you’re given it and that it will recall all that context. Almost all of them, after 8,000 or so tokens, they start making mistakes.

We have tried literally millions of permutations with different files, KBs, that have 10,000 files from all kinds of different file types and different contexts, and so we’ve seen it so many times that for us 8,000 tokens seems to be the reliable limit for these LLMs.

The ONLY reason I cancelled chatgpt-4 (turbo) and moved over to Claude (opus) was due to how easy it lost track of conversations and I had to keep reminding it of the initial text I had provided, and also reminding it of the small details.

I didn’t have extremeley long conversations with it, because I quickly realized it struggled with really long ones (that’s what she…), but they were not short either, they were moderate. And when it started messing up, I just used to clear all my conversations and start fresh. But I had to do that quite often.

So this benchmark is really important for me, unfortunately, the only one I found so far is the one shared above and it compares is to Claude (sonnet) which makes no sense, as it should be comparing it to Claude (opus)

I guess I’ll have to wait until it’s released to the public for free and test for myself and I’ll just simply go with what’s best, claude or openai, I don’t care, I just want it to get the job done.

1 Like

These results are credible and echo other benchmarks and testing I’ve done myself, eg:

Long context retrieval is important, otherwise long context becomes pointless. Shortcuts here absolute would speed things up / make them cheaper, but quality has obviously degraded since turbo.

That said, the > 32k results are quite fascinating. Maybe the idea is 4o is better on very long context!! Which is wild, when you think about it. Still - compromises have been made and is not breakthrough.

My general sense is that looking around the world and all the releases that have occurred since gpt4 (big 3, os models, agents, etc) , we are currently bumping up against a ceiling in terms of capability.

I was referring to these.