List of fresh gpt-4o benchmarks, please add

qrdl · May 14, 2024, 7:23am

Thought I’d share some benchmarks I’ve ran into

better perf

x.com really great eval. note how opus was lobotomized here (likely on purpose for safety reasons)

same perf

x.com hints of asymptotic behavior

worse

x.com ← intriguing. might hint as to where the speed / cost improvements came from
https://x.com/rohanpaul_ai/status/1791885754831929597

It’s hard to tell which benchmarks have been optimized for specific models. Eg, a lot of work went into getting aider.chat to perform on gpt-4-turbo

qrdl · May 16, 2024, 3:55pm

The dangers of using lmsys leaderboard: huge drop in 4o score from openai’s rather excitable initial tweet x.com

Topic		Replies	Views
Performance of GPT-4o on the Needle in a Haystack Benchmark API chatgpt , api , gpt-4o	13	5878	June 13, 2024
GPT-4-Turbo models perform better the older GPT-4 models in LMSys benchmark API gpt-4 , api	14	6680	May 13, 2024
Testing New GPT-4o vs Top 5 AI Community gpt-4 , chatgpt , gemini , claude3 , gpt-4o	0	3037	May 14, 2024
GPT-4-Turbo and GPT-4-O benchmarks released! They do well compared to the marketplace Community gpt-4	7	26995	May 17, 2024
Gpt-4o tokens per second comparable to gpt-3.5-turbo. Data and analysis API gpt-4 , gpt-35-turbo , playground , gpt-4-turbo , gpt-4o	3	12875	August 16, 2024