Hi!
You might want to also evaluate GPT-4-Turbo (maybe 1106 or 0125), GPT-4 (0314), and obviously 4.5 (don’t forget the system prompt). You might be surprised (and full disclosure, I’d be interested in your results!)
My expectation is that these tasks require exquisite understanding of nuance - something that bigger models are considerably better at. O3-mini is just a tiny, tiny model that’s been trained to almost algorithmically churn through reasoning steps - but reasoning ability alone isn’t really a sign of high intelligence in my book - especially if the reasoner isn’t capable of understanding the concentrated subject in the first place.
In other words, O3 is much better at pulling in a lot of diverse (and simple) information from the broader context and filtering/aggregating it for relevance, but not so much at grasping individual, condensed information.
They’re just different tools for different jobs. And you’ve hit one of the major limitations of tiny reasoning models. There’s a lot of opportunity in letting these models run in tandem, and using the right one for the right task.
Does this somewhat answer your question?