Is o3 just huge api spend?

Some of the results of o3 sound impressive, but considering they are spending somethin like 3200 USD per task for some stuff like the frontier math, I suspect you can achieve o3 level performance via most of the best models today. eg, like deepseek

If I had to guess, stuff like strawberry and o3 are just CoT, and there have been no real breakthroughs at OAI here compared to what the general community can do.

In fact, I suspect there are some CoT solutions out there far superior. This is because teams can build all this without massive gpus. Just point it at APIs

A good example of this is https://www.swebench.com/

Why isnā€™t o1 beating out everything? Itā€™s because custom built CoT agents (many of which use their own llms!) are far superior.

I dunno why OAI is trying to compete in this space, tbh. They should be facilitating external o3 type solutions. Ie, build tools which makes their platform the best for building CoT.

If I had to guess it was because they hit a ceiling and they want to show progress and CoT solutions would do that.

Another possibility is they want to show demos of what is possible and encourage people to build these things. $3200 per task would be great revenue for them.

Though if the latter is the case, they should be more open about their research in this area.

Just showing the actual CoT similar to what everyone else is doing would be a start. But maybe they didnā€™t do that because they didnā€™t want people to know how poor their CoT models actually are.

6 Likes

my understanding was that they spent that amount per task on solving simple picture puzzles

2 Likes

Heh well arc is not that simple :slight_smile: But they also solved frontier math, which is extremely high value for sure. x.com

Admittedly, it could be the 25% they solve were the easiest split

Still a huge improvement from previous sota

2 Likes

yeah the initial pic you posted (apples to apples comparison) had amturk achieve 75% @ ~3$, and o3(ā€œtunedā€, ā€œhighā€) achieve 88% @ ~3200$

I donā€™t think an amturk solves frontier math at 3$ a problem.

That said, I donā€™t expect to understand the frontier math set - so itā€™s kinda useless as a benchmark to me. Trillions of dollars went into that physics frontier that investigated theoretical space noodles for 60 years based on ā€œtrust me broā€

Since OpenAI has been known (to me) to be gaming benchmarks (Iā€™m implying no malice here, could have been a design target), Iā€™m not super impressed by the announcement that a super secret black box has solved 25% of some other incomprehensible black box.

I donā€™t need a model for solving esoteric philosophical problems. I just need a model that knows when itā€™s out of its element.

:\subset

3 Likes

Some of the results of o3 sound impressive, but considering they are spending somethin like 3200 USD per task for some stuff like the frontier math, I suspect you can achieve o3 level performance via most of the best models today. eg, like deepseek

the second place one the arc agi benchmark spent 10k using claude.

there are 800+ tests on the 400 json evaluation benchmark.

even running local neuro networks takes a good 30-60 minutes to train and run against. If o1 and o1 pro takes 1-2 minutes per difficult task, imagine o3ā€¦ maybe 10 minutes to complete the benchmark?

since the arc agi pub event went on for 24h, I wouldnā€™t say 10h to run o3 would be a unreasonable amount of time to complete the benchmark

very important: this is NOT reliable information, could very well be fake news:

considering how much fake news and leaks surrounds LLMs, you never know whats true about it and its better to just wait until it is announced

3 Likes

ok, so this part is interesting. previous reply was in regards to the image.

I think its an interesting take why they didnā€™t share their CoT. Anthropic was outperforming them in software dev and CoT wasnā€™t really out there. People are down to spend $200 bucks for o1 pro because that is cheaper then using their own gpt-4 or gpt-4o CoT.

why not compete in it? and why even share it?

o3 is probably CoT + their new fine tunning tool, which Iā€™d imagine would take 3x the time to respond as it has to gather information about it, maybe using search engine to gather the finetune data, maybe its part of the CoT to generate the data to finetune, once the model is finetuned to that specific problem, generate the output.

I think being open about their CoT would not help them, they are competing against google that is arguably quickly catching up and even hiring openai devs (check out @logankilpatrick being petty about it on twitter)

I think there are still lots of interesting ways to improve performance which isnā€™t strictly related to more high quality data and more computeā€¦ but thatā€™s just my hot take on it

4 Likes


Hmm, I donā€™t get the reference. Though, tbh, I donā€™t get many of his tweets.

As for catching up, OpenAI just isnā€™t ranking much these days. lmsys, openrouter, they are definitely falling by the wayside.

1 Like

Gotta get dem sweet benchmarks. Investors need to see graph lines go vroom like stock market or get angry.

They sold the concept of AGI.

Although, Iā€™m still a gpt-4o fanboy. It does most of my automation tasks.

I really donā€™t like the o series. I imagine they would stick their nose into the air from seeing this. Yet, itā€™s only really good at destroying and rebuilding. Iterations be damned.

4 Likes

Slightly off point but o1 was Strawberry. Is o3 still Strawberry series? Clearly itā€™s an o zone layer. ^^

2 Likes

Hmm, I donā€™t get the reference.


Reference: https://openai.com/index/scale-the-benefits-of-ai/

I think he is mocking the raise due to the fact that Google figured out chain of thought in 5 months. Alphabetā€™s net income (after taxes) was around 100B in 2024. His official reason for leaving was that it didnā€™t feel like a startup anymore, but his tweets indicate thereā€™s more to it.

The truth is that its not that hard to build an LLM, but its expensive to get the compute you need and the data you need.

2 Likes

I really donā€™t like the o series. I imagine they would stick their nose into the air from seeing this. Yet, itā€™s only really good at destroying and rebuilding. Iterations be damned.

I do like o1 pro. It might take forever but:

  1. it usually outperforms 4o
  2. if you use it everyday, its cheaper then running your own CoT with gpt-4 or gpt-4o
  3. no open source llm (or close source) with CoT even comes close
  • What I miss the most is the real time vision model.

  • It would also be terribly useful to give a cost prediction per minute in the docs for real time and webrtc.

  • I think having a model loop according to how many instances you use per minute or hour would be helpful to for those that use swarm like architectures.

I really hope o3 comes out on chatgpt without creating a new tier plan, like ā€œlegend planā€ with unlimited o3, which would be above pro, but costs 2kā€¦

2 Likes

I was surprised that Google poached him, but big companies will often do stuff like that for reasons other than individual capability. I wonder if they, uhm, cringe every time he tweets like I do.

I remember thinking he needed to tone it down and get back to work while he was at OpenAI.

Maybe heā€™s seeing the writing on the wall at Google and his time is limited and heā€™s thinking maybe he can somehow parlay this into his own company. I know you think it is satire, but I wouldnā€™t over estimate him. Failing upwardsā€¦

1 Like

The truth is that its not that hard to build an LLM, but its expensive to get the compute you need and the data you need.

Ahhhā€¦ Have you deep dived into deepseek yet? Interested in your thoughts once you have.

  1. no open source llm (or close source) with CoT even comes close

Hmm. Not sure I agree. Botique CoT can easily out perform, eg swe bench above.

Honestly, MoE and botique CoT is the way forward I think, unless/until there is a breakthrough on the base models. The reason you might have missed this is you really have to dig to find the pearls amongst the mud.

An interesting example ā€“ x.com (all the cool kids use sonnet these days, check openrouter)

I really hope o3 comes out on chatgpt without creating a new tier plan, like ā€œlegend planā€ with unlimited o3, which would be above pro, but costs 2kā€¦

I dunno, Iā€™m skeptical. I think the future of LLMs is breakthroughs, and the probability of a breakthrough happening at OpenAI is roughly 100/number of companies working on LLMs, imho.

I just dunno if there is a moat. Deepseek really has upended everything. Maybe itā€™s a scam though, not sure. Some people have said it has poor CoT which could be true, but still, itā€™s pretty wild.

Also, watching OpenAI drift down in all the benchmarks speaks volumes I think. Not a lot of first mover advantage here. OAI kinda proved that by rugging Google.

1 Like

Interesting. I did a bunch of arc stuff manually and I didnā€™t do so well. Iā€™d like to see the details on mturk versus o3.

Reading the chart in my OP I see 32% for mturk versus 88% for o3? If so, cheaper, sure, but you get what you pay for.

And yes, I think OpenAI is desperately gaming benchmarks in the worst ways. They are a startup burning cash insanely and such startups tend to do pretty sketchy things.

Whatā€™s interesting though, is I think there are a lot of sacrificial startups in this industry. Deepseek, again, is a very very curious example of this. One of their lead engineers got poached by a ā€˜bigā€™ company. Whatā€™s the moat? Itā€™s almost like nobody cares, they just want to build AGI however they can, or at the very least - force Google to make use of all those AI engineers they hired.

1 Like

The original is easier to read, the average mturker can get slightly above 75% here.

Now that Iā€™m looking at it again, it just shows how far behind o1 actually is (look at kaggle sota). Itā€™s quite possible that o3 is also just some (or a combo of) gpt-4 variant(s) under the hood with a massive budget for solution space exploration. (look at the other winners papers ARC Prize 2024)

soo, a big fat nothin :hamburger: ? :thinking: :frowning_face:

1 Like

Kaggle comps are overfitting monstrosities, so I wouldnā€™t get too excited about their SOTA.

88% versus 75% is still a big deal. Eeking out those last few percentages gets exponentially harder on every benchmark.

Also, Iā€™ve heard that some people are saying that the AI often finds corrections that are wrong in the eval sets.

1 Like

well yeah, if the model gets it wrong, the developers take a look at the failures to see why. then they tune the model
image
and go again. and again, and again

and in the process of tuning for those that remain, they (developers) discover problematic questions simply by virtue of them not being solvable. Thatā€™s typically a point at which the benchmark starts becoming useless and obsolete.

1 Like

Is AI chasing AGI and big numbers on bemchmatks just a modern version of the Tulip Bulb craze?

I feel iits a bit of both

2 Likes

Thatā€™s a pretty good analogy, maybe :thinking:

The beauty is, that if you know, you can choose to have it not affect you.
:popcorn:

1 Like