Performance of GPT-4o on the Needle in a Haystack Benchmark

brandonminiwheats · May 13, 2024, 9:50pm

I’m curious if anyone has tested the new GPT-4o model using the “Needle in a Haystack” benchmark. If not, perhaps the developers could share their insights on how it might perform in comparison to the GPT-4 Turbo. Any feedback or data would be greatly appreciated!

qrdl · May 14, 2024, 7:57am

posted a few here

For your query - https://twitter.com/LouisKnightWebb/status/1790265899255017893

but also note

https://twitter.com/JoshPurtell/status/1790102029773246861

I find the latter quite curious. It echo’s my openai-eval PR around multistep math problems, such that it has to provide the answer and no other text.

qrdl · May 15, 2024, 8:00am

Here we go for the needle in haystack

https://twitter.com/markerdmann/status/1790600854451458356

Very curious result. Seems to conflict with the above result from https://twitter.com/JoshPurtell/status/1790102029773246861

Jirito · May 15, 2024, 4:35pm

This is what I was looking for, thank you!

At the end of the day, when it comes to needle in a haystack, Claude 3 Opus still outperforms gpt4o, no?

Accord · May 17, 2024, 4:06am

you mean really underperforms? check out the benchmark. GPT-4o incomparably better

hmill901 · May 17, 2024, 4:16am

Opus beats it still according to those benchmarks. Omni beats sonnet though, which is not at all surprising.

Jirito · May 17, 2024, 4:27am

Which Claude 3 opus needle in haystack benchmark are you referring to, where it shows underperforming results in comparison to gpt4o needle in haystack, care to share the source?

From what I’ve seen, opus shows better results, over a larger context window too

Accord · May 17, 2024, 4:27am

Here you need to open the image and turn the page to the second one.

Jirito · May 17, 2024, 4:34am

In that post, it’s using Sonnet as the comparison, not Opus

This is the only benchmark I could find on Opus, which is in a different format unfortunately

EduGPT · May 18, 2024, 1:21am

Keep in mind that this test in specific does not mean that it can actually recall just random documents that you’re given it and that it will recall all that context. Almost all of them, after 8,000 or so tokens, they start making mistakes.

We have tried literally millions of permutations with different files, KBs, that have 10,000 files from all kinds of different file types and different contexts, and so we’ve seen it so many times that for us 8,000 tokens seems to be the reliable limit for these LLMs.

linkarzu · May 18, 2024, 11:04am

The ONLY reason I cancelled chatgpt-4 (turbo) and moved over to Claude (opus) was due to how easy it lost track of conversations and I had to keep reminding it of the initial text I had provided, and also reminding it of the small details.

I didn’t have extremeley long conversations with it, because I quickly realized it struggled with really long ones (that’s what she…), but they were not short either, they were moderate. And when it started messing up, I just used to clear all my conversations and start fresh. But I had to do that quite often.

So this benchmark is really important for me, unfortunately, the only one I found so far is the one shared above http://nian.llmonpy.ai and it compares is to Claude (sonnet) which makes no sense, as it should be comparing it to Claude (opus)

I guess I’ll have to wait until it’s released to the public for free and test for myself and I’ll just simply go with what’s best, claude or openai, I don’t care, I just want it to get the job done.

qrdl · May 19, 2024, 5:58am

https://x.com/rohanpaul_ai/status/1791885754831929597

These results are credible and echo other benchmarks and testing I’ve done myself, eg: x.com

Long context retrieval is important, otherwise long context becomes pointless. Shortcuts here absolute would speed things up / make them cheaper, but quality has obviously degraded since turbo.

That said, the > 32k results are quite fascinating. Maybe the idea is 4o is better on very long context!! Which is wild, when you think about it. Still - compromises have been made and is not breakthrough.

My general sense is that looking around the world and all the releases that have occurred since gpt4 (big 3, os models, agents, etc) , we are currently bumping up against a ceiling in terms of capability.

Accord · May 19, 2024, 9:45pm

I was referring to these.

brandonminiwheats · June 13, 2024, 12:02pm

Has anyone done any more testing on this?

Topic		Replies	Views
GPT-4-Turbo models perform better the older GPT-4 models in LMSys benchmark API gpt-4 , api	14	6403	May 13, 2024
List of fresh gpt-4o benchmarks, please add Community gpt-4o	1	3307	May 16, 2024
GPT-4o vs. gpt-4-turbo-2024-04-09, gpt-4o loses API gpt-4	38	14499	June 11, 2024
Gpt4 comparison to anthropic Opus on benchmarks Community gpt-4 , api	9	38798	June 8, 2024
GPT-4-Turbo and GPT-4-O benchmarks released! They do well compared to the marketplace Community gpt-4	7	23944	May 17, 2024

Performance of GPT-4o on the Needle in a Haystack Benchmark

Related topics