Benchmarks and leaderboard don't match my experience

Hey just wondered if more developers working with these LLMs has the same experience as me.

GPT-4o is better on benchmarks and leaderboard, but it really underperforms when plugging it into my applications. It doesn’t really follow the instructions as well (multiple instructions is really bad) and especially can’t intelligently act in a specific way specified in the system prompt compared to GPT-4-turbo and Llama-3-70b.

Anyone else?

