Performance of GPT-4o on the Needle in a Haystack Benchmark

The ONLY reason I cancelled chatgpt-4 (turbo) and moved over to Claude (opus) was due to how easy it lost track of conversations and I had to keep reminding it of the initial text I had provided, and also reminding it of the small details.

I didn’t have extremeley long conversations with it, because I quickly realized it struggled with really long ones (that’s what she…), but they were not short either, they were moderate. And when it started messing up, I just used to clear all my conversations and start fresh. But I had to do that quite often.

So this benchmark is really important for me, unfortunately, the only one I found so far is the one shared above http://nian.llmonpy.ai and it compares is to Claude (sonnet) which makes no sense, as it should be comparing it to Claude (opus)

I guess I’ll have to wait until it’s released to the public for free and test for myself and I’ll just simply go with what’s best, claude or openai, I don’t care, I just want it to get the job done.

1 Like