Worse results when using GPT-4o as an evaluator

It would be nice to view how much longer did gpt-4 take compared to gpt-4o. If I recall correctly, gpt-4o is a quantization of gpt-4-turbo. But I could be wrong about this!

I’ve posted a comment on a topic comparing both models. The chart I provided was published by OpenAI on github, if I recall correctly, it was the simple-eval repository.

1 Like