List of fresh gpt-4o benchmarks, please add

Thought I’d share some benchmarks I’ve ran into

better perf

x.com really great eval. note how opus was lobotomized here (likely on purpose for safety reasons)

same perf

x.com hints of asymptotic behavior

worse

x.com ← intriguing. might hint as to where the speed / cost improvements came from
https://x.com/rohanpaul_ai/status/1791885754831929597

It’s hard to tell which benchmarks have been optimized for specific models. Eg, a lot of work went into getting aider.chat to perform on gpt-4-turbo

4 Likes

The dangers of using lmsys leaderboard: huge drop in 4o score from openai’s rather excitable initial tweet x.com

4 Likes