Some of the results of o3 sound impressive, but considering they are spending somethin like 3200 USD per task for some stuff like the frontier math, I suspect you can achieve o3 level performance via most of the best models today. eg, like deepseek
If I had to guess, stuff like strawberry and o3 are just CoT, and there have been no real breakthroughs at OAI here compared to what the general community can do.
In fact, I suspect there are some CoT solutions out there far superior. This is because teams can build all this without massive gpus. Just point it at APIs
Why isnāt o1 beating out everything? Itās because custom built CoT agents (many of which use their own llms!) are far superior.
I dunno why OAI is trying to compete in this space, tbh. They should be facilitating external o3 type solutions. Ie, build tools which makes their platform the best for building CoT.
If I had to guess it was because they hit a ceiling and they want to show progress and CoT solutions would do that.
Another possibility is they want to show demos of what is possible and encourage people to build these things. $3200 per task would be great revenue for them.
Though if the latter is the case, they should be more open about their research in this area.
Just showing the actual CoT similar to what everyone else is doing would be a start. But maybe they didnāt do that because they didnāt want people to know how poor their CoT models actually are.
yeah the initial pic you posted (apples to apples comparison) had amturk achieve 75% @ ~3$, and o3(ātunedā, āhighā) achieve 88% @ ~3200$
I donāt think an amturk solves frontier math at 3$ a problem.
That said, I donāt expect to understand the frontier math set - so itās kinda useless as a benchmark to me. Trillions of dollars went into that physics frontier that investigated theoretical space noodles for 60 years based on ātrust me broā
Since OpenAI has been known (to me) to be gaming benchmarks (Iām implying no malice here, could have been a design target), Iām not super impressed by the announcement that a super secret black box has solved 25% of some other incomprehensible black box.
I donāt need a model for solving esoteric philosophical problems. I just need a model that knows when itās out of its element.
Some of the results of o3 sound impressive, but considering they are spending somethin like 3200 USD per task for some stuff like the frontier math, I suspect you can achieve o3 level performance via most of the best models today. eg, like deepseek
the second place one the arc agi benchmark spent 10k using claude.
there are 800+ tests on the 400 json evaluation benchmark.
even running local neuro networks takes a good 30-60 minutes to train and run against. If o1 and o1 pro takes 1-2 minutes per difficult task, imagine o3ā¦ maybe 10 minutes to complete the benchmark?
since the arc agi pub event went on for 24h, I wouldnāt say 10h to run o3 would be a unreasonable amount of time to complete the benchmark
very important: this is NOT reliable information, could very well be fake news:
ok, so this part is interesting. previous reply was in regards to the image.
I think its an interesting take why they didnāt share their CoT. Anthropic was outperforming them in software dev and CoT wasnāt really out there. People are down to spend $200 bucks for o1 pro because that is cheaper then using their own gpt-4 or gpt-4o CoT.
why not compete in it? and why even share it?
o3 is probably CoT + their new fine tunning tool, which Iād imagine would take 3x the time to respond as it has to gather information about it, maybe using search engine to gather the finetune data, maybe its part of the CoT to generate the data to finetune, once the model is finetuned to that specific problem, generate the output.
I think being open about their CoT would not help them, they are competing against google that is arguably quickly catching up and even hiring openai devs (check out @logankilpatrick being petty about it on twitter)
I think there are still lots of interesting ways to improve performance which isnāt strictly related to more high quality data and more computeā¦ but thatās just my hot take on it
Gotta get dem sweet benchmarks. Investors need to see graph lines go vroom like stock market or get angry.
They sold the concept of AGI.
Although, Iām still a gpt-4o fanboy. It does most of my automation tasks.
I really donāt like the o series. I imagine they would stick their nose into the air from seeing this. Yet, itās only really good at destroying and rebuilding. Iterations be damned.
I think he is mocking the raise due to the fact that Google figured out chain of thought in 5 months. Alphabetās net income (after taxes) was around 100B in 2024. His official reason for leaving was that it didnāt feel like a startup anymore, but his tweets indicate thereās more to it.
The truth is that its not that hard to build an LLM, but its expensive to get the compute you need and the data you need.
I really donāt like the o series. I imagine they would stick their nose into the air from seeing this. Yet, itās only really good at destroying and rebuilding. Iterations be damned.
I do like o1 pro. It might take forever but:
it usually outperforms 4o
if you use it everyday, its cheaper then running your own CoT with gpt-4 or gpt-4o
no open source llm (or close source) with CoT even comes close
What I miss the most is the real time vision model.
It would also be terribly useful to give a cost prediction per minute in the docs for real time and webrtc.
I think having a model loop according to how many instances you use per minute or hour would be helpful to for those that use swarm like architectures.
I really hope o3 comes out on chatgpt without creating a new tier plan, like ālegend planā with unlimited o3, which would be above pro, but costs 2kā¦
I was surprised that Google poached him, but big companies will often do stuff like that for reasons other than individual capability. I wonder if they, uhm, cringe every time he tweets like I do.
I remember thinking he needed to tone it down and get back to work while he was at OpenAI.
Maybe heās seeing the writing on the wall at Google and his time is limited and heās thinking maybe he can somehow parlay this into his own company. I know you think it is satire, but I wouldnāt over estimate him. Failing upwardsā¦
The truth is that its not that hard to build an LLM, but its expensive to get the compute you need and the data you need.
Ahhhā¦ Have you deep dived into deepseek yet? Interested in your thoughts once you have.
no open source llm (or close source) with CoT even comes close
Hmm. Not sure I agree. Botique CoT can easily out perform, eg swe bench above.
Honestly, MoE and botique CoT is the way forward I think, unless/until there is a breakthrough on the base models. The reason you might have missed this is you really have to dig to find the pearls amongst the mud.
An interesting example ā x.com (all the cool kids use sonnet these days, check openrouter)
I really hope o3 comes out on chatgpt without creating a new tier plan, like ālegend planā with unlimited o3, which would be above pro, but costs 2kā¦
I dunno, Iām skeptical. I think the future of LLMs is breakthroughs, and the probability of a breakthrough happening at OpenAI is roughly 100/number of companies working on LLMs, imho.
I just dunno if there is a moat. Deepseek really has upended everything. Maybe itās a scam though, not sure. Some people have said it has poor CoT which could be true, but still, itās pretty wild.
Also, watching OpenAI drift down in all the benchmarks speaks volumes I think. Not a lot of first mover advantage here. OAI kinda proved that by rugging Google.
Interesting. I did a bunch of arc stuff manually and I didnāt do so well. Iād like to see the details on mturk versus o3.
Reading the chart in my OP I see 32% for mturk versus 88% for o3? If so, cheaper, sure, but you get what you pay for.
And yes, I think OpenAI is desperately gaming benchmarks in the worst ways. They are a startup burning cash insanely and such startups tend to do pretty sketchy things.
Whatās interesting though, is I think there are a lot of sacrificial startups in this industry. Deepseek, again, is a very very curious example of this. One of their lead engineers got poached by a ābigā company. Whatās the moat? Itās almost like nobody cares, they just want to build AGI however they can, or at the very least - force Google to make use of all those AI engineers they hired.
Now that Iām looking at it again, it just shows how far behind o1 actually is (look at kaggle sota). Itās quite possible that o3 is also just some (or a combo of) gpt-4 variant(s) under the hood with a massive budget for solution space exploration. (look at the other winners papers ARC Prize 2024)
well yeah, if the model gets it wrong, the developers take a look at the failures to see why. then they tune the model
and go again. and again, and again
and in the process of tuning for those that remain, they (developers) discover problematic questions simply by virtue of them not being solvable. Thatās typically a point at which the benchmark starts becoming useless and obsolete.