List of fresh gpt-4o benchmarks, please add

Thought I’d share some benchmarks I’ve ran into

better perf really great eval. note how opus was lobotomized here (likely on purpose for safety reasons)

same perf hints of asymptotic behavior

worse ← intriguing. might hint as to where the speed / cost improvements came from

It’s hard to tell which benchmarks have been optimized for specific models. Eg, a lot of work went into getting to perform on gpt-4-turbo


The dangers of using lmsys leaderboard: huge drop in 4o score from openai’s rather excitable initial tweet