Performance of GPT-4o on the Needle in a Haystack Benchmark

EduGPT · May 18, 2024, 1:21am

Keep in mind that this test in specific does not mean that it can actually recall just random documents that you’re given it and that it will recall all that context. Almost all of them, after 8,000 or so tokens, they start making mistakes.

We have tried literally millions of permutations with different files, KBs, that have 10,000 files from all kinds of different file types and different contexts, and so we’ve seen it so many times that for us 8,000 tokens seems to be the reliable limit for these LLMs.

Topic		Replies	Views
GPT-4-Turbo models perform better the older GPT-4 models in LMSys benchmark API gpt-4 , api	14	6566	May 13, 2024
List of fresh gpt-4o benchmarks, please add Community gpt-4o	1	3447	May 16, 2024
GPT-4o vs. gpt-4-turbo-2024-04-09, gpt-4o loses API gpt-4	38	14872	June 11, 2024
Gpt4 comparison to anthropic Opus on benchmarks Community gpt-4 , api	9	41171	June 8, 2024
GPT-4-Turbo and GPT-4-O benchmarks released! They do well compared to the marketplace Community gpt-4	7	26318	May 17, 2024

Performance of GPT-4o on the Needle in a Haystack Benchmark

Related topics