GPT 4.5 | A Thank You To The Team Behind It

When I don’t agree, I post. So, it’s only fair to do the same when I’m genuinely impressed.

The last two GPT 4.0 upgrades left me genuinely concerned about OpenAI’s longterm direction, enough that I found myself deeply questioning the decision of my ongoing AI partnership strategy and evaluating others. Despite GPT 4.0 remaining objectively (and frustratingly) ahead of its LLM peers, I sensed a troubling shift in focus and overall business strategy effecting the user experience, and looked at the updates as a regression for the models available to Plus subscriptions.

However, after “rigorously” testing GPT 4.5 by comparing it in highly complex head-to-head challenges purposely exploiting it’s assumed weaknesses against its GPT 4.0 predecessor and independently validated by Claude Sonnet 3.7—I can confidently say that my faith has been entirely restored.

To be frank, I had to change my shorts.

A sincere congratulations to the entire team behind GPT 4.5.

Thank you!

  • A (no longer) concerned AI geek
2 Likes

Awesome to hear you like it! What’ve you been using 4.5 for that you find it better at?

Still breaking it in, but next phase is to see what improvements in coding for a personal project that I have been working on over a period of time with various GPT models. If there is a significant difference here, then we need to pop some bottles :slight_smile:

1 Like

It’s cool what you found about GPT 4.5, but I have to think to myself you must’ve spent an arm and a leg since you said you did very rigorous testing. I would consider the $200/plan spending an arm and a leg, too. :laughing:

I have my own method for systematically testing LLMs against each other. The process involves pitting different models, from multiple vendors and versions, directly against one another in solving highly complex puzzles. Each model designs, executes, and rigorously scores these puzzles, purposely targeting their competitors’ known weak points, and the results of each score is shared to promote competitive drive (just a pattern match, I know). Once weaknesses can be identified in reasoning or design, I loop back, retesting those exact areas later to measure improvement, consistency, and growth over the game. There are other areas and ways I like to confirm upgrades, but this one is fun and nothing a monkey couldn’t handle.

Do you share these results with anyone or post them anywhere we can see them? I’m curious to see the results.

Sometimes, depends on what I find each time - what I’m comfortable sharing here isn’t something I keep stored long-term after reviewing. To be honest, I find these AI games both interesting and fun, but they also serve as a way for me to learn. It’s cool to see how each model differs in reasoning (they must show all work) and where they excel, as well as where I can push them to adapt in real time.