Lately, I’ve been comparing GPT-4o and O3-Mini on the same task to evaluate their problem-solving capabilities. The task is to determine whether User 1 is a phishing scammer based on a conversation between User 1 and User 2.
Since the model needs to infer from the conversation, I initially expected O3-Mini to perform better than GPT-4o in terms of F1-score. However, GPT-4o consistently outperformed O3-Mini in most cases, which made me curious about the underlying reasons. If you have any ideas related to this, I’d love to hear your thoughts.
For context, the prompt provides a list of suspicious behaviors that should be considered when determining whether the conversation is a scam. The model is expected to return True if certain conditions are met and False otherwise. I added “let’s think step by step” to GPT-4o’s prompt to encourage CoT (Chain of Thought) reasoning, but I omitted it for O3-Mini since I heard that it could negatively impact performance. Other than that, both models received the exact same prompt.
Could this be the reason? Was the prompt more optimized for GPT-4o, leading to a performance drop in O3-Mini? If so, what kind of prompting would be more effective for O3-Mini?
I’ve been exploring multiple angles but have hit some limitations, so I wanted to reach out. If you have any insights or experiences to share, I’d really appreciate it!
You might want to also evaluate GPT-4-Turbo (maybe 1106 or 0125), GPT-4 (0314), and obviously 4.5 (don’t forget the system prompt). You might be surprised (and full disclosure, I’d be interested in your results!)
My expectation is that these tasks require exquisite understanding of nuance - something that bigger models are considerably better at. O3-mini is just a tiny, tiny model that’s been trained to almost algorithmically churn through reasoning steps - but reasoning ability alone isn’t really a sign of high intelligence in my book - especially if the reasoner isn’t capable of understanding the concentrated subject in the first place.
In other words, O3 is much better at pulling in a lot of diverse (and simple) information from the broader context and filtering/aggregating it for relevance, but not so much at grasping individual, condensed information.
They’re just different tools for different jobs. And you’ve hit one of the major limitations of tiny reasoning models. There’s a lot of opportunity in letting these models run in tandem, and using the right one for the right task.
I had vaguely assumed that since O3-mini is a reasoning model, it would have reasoning capabilities similar to GPT-4o. But I think I overlooked how much the model size can impact the results!
I actually wanted to test GPT-4.5 since it was recently released, but I haven’t been able to due to cost constraints. I’ll follow your recommendation and test with GPT-4-Turbo or GPT-4 instead! I’m also curious to see how well they capture areas that GPT-4o might be missing. (If the results are good, I’ll share them!)