Thought I’d share some benchmarks I’ve ran into
better perf
x.com really great eval. note how opus was lobotomized here (likely on purpose for safety reasons)
same perf
x.com hints of asymptotic behavior
worse
x.com ← intriguing. might hint as to where the speed / cost improvements came from
https://x.com/rohanpaul_ai/status/1791885754831929597
It’s hard to tell which benchmarks have been optimized for specific models. Eg, a lot of work went into getting aider.chat to perform on gpt-4-turbo