Is o3 just huge api spend?

qrdl · December 30, 2024, 5:58am

Some of the results of o3 sound impressive, but considering they are spending somethin like 3200 USD per task for some stuff like the frontier math, I suspect you can achieve o3 level performance via most of the best models today. eg, like deepseek

If I had to guess, stuff like strawberry and o3 are just CoT, and there have been no real breakthroughs at OAI here compared to what the general community can do.

In fact, I suspect there are some CoT solutions out there far superior. This is because teams can build all this without massive gpus. Just point it at APIs

A good example of this is https://www.swebench.com/

Why isn’t o1 beating out everything? It’s because custom built CoT agents (many of which use their own llms!) are far superior.

I dunno why OAI is trying to compete in this space, tbh. They should be facilitating external o3 type solutions. Ie, build tools which makes their platform the best for building CoT.

If I had to guess it was because they hit a ceiling and they want to show progress and CoT solutions would do that.

Another possibility is they want to show demos of what is possible and encourage people to build these things. $3200 per task would be great revenue for them.

Though if the latter is the case, they should be more open about their research in this area.

Just showing the actual CoT similar to what everyone else is doing would be a start. But maybe they didn’t do that because they didn’t want people to know how poor their CoT models actually are.

Diet · December 30, 2024, 8:49am

my understanding was that they spent that amount per task on solving simple picture puzzles

qrdl · December 30, 2024, 11:07am

Heh well arc is not that simple But they also solved frontier math, which is extremely high value for sure. x.com

Admittedly, it could be the 25% they solve were the easiest split

Still a huge improvement from previous sota

Diet · December 30, 2024, 1:06pm

yeah the initial pic you posted (apples to apples comparison) had amturk achieve 75% @ ~3$, and o3(“tuned”, “high”) achieve 88% @ ~3200$

I don’t think an amturk solves frontier math at 3$ a problem.

That said, I don’t expect to understand the frontier math set - so it’s kinda useless as a benchmark to me. Trillions of dollars went into that physics frontier that investigated theoretical space noodles for 60 years based on “trust me bro”

Since OpenAI has been known (to me) to be gaming benchmarks (I’m implying no malice here, could have been a design target), I’m not super impressed by the announcement that a super secret black box has solved 25% of some other incomprehensible black box.

I don’t need a model for solving esoteric philosophical problems. I just need a model that knows when it’s out of its element.

:\subset

anon37218972 · December 30, 2024, 1:28pm

Some of the results of o3 sound impressive, but considering they are spending somethin like 3200 USD per task for some stuff like the frontier math, I suspect you can achieve o3 level performance via most of the best models today. eg, like deepseek

the second place one the arc agi benchmark spent 10k using claude.

there are 800+ tests on the 400 json evaluation benchmark.

even running local neuro networks takes a good 30-60 minutes to train and run against. If o1 and o1 pro takes 1-2 minutes per difficult task, imagine o3… maybe 10 minutes to complete the benchmark?

since the arc agi pub event went on for 24h, I wouldn’t say 10h to run o3 would be a unreasonable amount of time to complete the benchmark

very important: this is NOT reliable information, could very well be fake news:

considering how much fake news and leaks surrounds LLMs, you never know whats true about it and its better to just wait until it is announced

anon37218972 · December 30, 2024, 1:38pm

qrdl:

If I had to guess, stuff like strawberry and o3 are just CoT, and there have been no real breakthroughs at OAI here compared to what the general community can do.

In fact, I suspect there are some CoT solutions out there far superior. This is because teams can build all this without massive gpus. Just point it at APIs

A good example of this is https://www.swebench.com/

Why isn’t o1 beating out everything? It’s because custom built CoT agents (many of which use their own llms!) are far superior.

I dunno why OAI is trying to compete in this space, tbh. They should be facilitating external o3 type solutions. Ie, build tools which makes their platform the best for building CoT.

If I had to guess it was because they hit a ceiling and they want to show progress and CoT solutions would do that.

Another possibility is they want to show demos of what is possible and encourage people to build these things. $3200 per task would be great revenue for them.

Though if the latter is the case, they should be more open about their research in this area.

Just showing the actual CoT similar to what everyone else is doing would be a start. But maybe they didn’t do that because they didn’t want people to know how poor their CoT models actually are.

ok, so this part is interesting. previous reply was in regards to the image.

I think its an interesting take why they didn’t share their CoT. Anthropic was outperforming them in software dev and CoT wasn’t really out there. People are down to spend $200 bucks for o1 pro because that is cheaper then using their own gpt-4 or gpt-4o CoT.

why not compete in it? and why even share it?

o3 is probably CoT + their new fine tunning tool, which I’d imagine would take 3x the time to respond as it has to gather information about it, maybe using search engine to gather the finetune data, maybe its part of the CoT to generate the data to finetune, once the model is finetuned to that specific problem, generate the output.

I think being open about their CoT would not help them, they are competing against google that is arguably quickly catching up and even hiring openai devs (check out @logankilpatrick being petty about it on twitter)

I think there are still lots of interesting ways to improve performance which isn’t strictly related to more high quality data and more compute… but that’s just my hot take on it

qrdl · December 30, 2024, 4:41pm

Hmm, I don’t get the reference. Though, tbh, I don’t get many of his tweets.

As for catching up, OpenAI just isn’t ranking much these days. lmsys, openrouter, they are definitely falling by the wayside.

anon10827405 · December 30, 2024, 5:00pm

Gotta get dem sweet benchmarks. Investors need to see graph lines go vroom like stock market or get angry.

They sold the concept of AGI.

Although, I’m still a gpt-4o fanboy. It does most of my automation tasks.

I really don’t like the o series. I imagine they would stick their nose into the air from seeing this. Yet, it’s only really good at destroying and rebuilding. Iterations be damned.

phyde1001 · December 31, 2024, 1:58am

Slightly off point but o1 was Strawberry. Is o3 still Strawberry series? Clearly it’s an o zone layer. ^^

anon37218972 · December 31, 2024, 3:55am

Hmm, I don’t get the reference.

Reference: https://openai.com/index/scale-the-benefits-of-ai/

I think he is mocking the raise due to the fact that Google figured out chain of thought in 5 months. Alphabet’s net income (after taxes) was around 100B in 2024. His official reason for leaving was that it didn’t feel like a startup anymore, but his tweets indicate there’s more to it.

The truth is that its not that hard to build an LLM, but its expensive to get the compute you need and the data you need.

anon37218972 · December 31, 2024, 4:00am

I really don’t like the o series. I imagine they would stick their nose into the air from seeing this. Yet, it’s only really good at destroying and rebuilding. Iterations be damned.

I do like o1 pro. It might take forever but:

it usually outperforms 4o
if you use it everyday, its cheaper then running your own CoT with gpt-4 or gpt-4o
no open source llm (or close source) with CoT even comes close

What I miss the most is the real time vision model.
It would also be terribly useful to give a cost prediction per minute in the docs for real time and webrtc.
I think having a model loop according to how many instances you use per minute or hour would be helpful to for those that use swarm like architectures.

I really hope o3 comes out on chatgpt without creating a new tier plan, like “legend plan” with unlimited o3, which would be above pro, but costs 2k…

qrdl · December 31, 2024, 6:25am

I was surprised that Google poached him, but big companies will often do stuff like that for reasons other than individual capability. I wonder if they, uhm, cringe every time he tweets like I do.

I remember thinking he needed to tone it down and get back to work while he was at OpenAI.

Maybe he’s seeing the writing on the wall at Google and his time is limited and he’s thinking maybe he can somehow parlay this into his own company. I know you think it is satire, but I wouldn’t over estimate him. Failing upwards…

qrdl · December 31, 2024, 6:31am

The truth is that its not that hard to build an LLM, but its expensive to get the compute you need and the data you need.

Ahhh… Have you deep dived into deepseek yet? Interested in your thoughts once you have.

no open source llm (or close source) with CoT even comes close

Hmm. Not sure I agree. Botique CoT can easily out perform, eg swe bench above.

Honestly, MoE and botique CoT is the way forward I think, unless/until there is a breakthrough on the base models. The reason you might have missed this is you really have to dig to find the pearls amongst the mud.

An interesting example – x.com (all the cool kids use sonnet these days, check openrouter)

I really hope o3 comes out on chatgpt without creating a new tier plan, like “legend plan” with unlimited o3, which would be above pro, but costs 2k…

I dunno, I’m skeptical. I think the future of LLMs is breakthroughs, and the probability of a breakthrough happening at OpenAI is roughly 100/number of companies working on LLMs, imho.

I just dunno if there is a moat. Deepseek really has upended everything. Maybe it’s a scam though, not sure. Some people have said it has poor CoT which could be true, but still, it’s pretty wild.

Also, watching OpenAI drift down in all the benchmarks speaks volumes I think. Not a lot of first mover advantage here. OAI kinda proved that by rugging Google.

qrdl · December 31, 2024, 6:44am

Interesting. I did a bunch of arc stuff manually and I didn’t do so well. I’d like to see the details on mturk versus o3.

Reading the chart in my OP I see 32% for mturk versus 88% for o3? If so, cheaper, sure, but you get what you pay for.

And yes, I think OpenAI is desperately gaming benchmarks in the worst ways. They are a startup burning cash insanely and such startups tend to do pretty sketchy things.

What’s interesting though, is I think there are a lot of sacrificial startups in this industry. Deepseek, again, is a very very curious example of this. One of their lead engineers got poached by a ‘big’ company. What’s the moat? It’s almost like nobody cares, they just want to build AGI however they can, or at the very least - force Google to make use of all those AI engineers they hired.

Diet · December 31, 2024, 8:10am

The original is easier to read, the average mturker can get slightly above 75% here.

Now that I’m looking at it again, it just shows how far behind o1 actually is (look at kaggle sota). It’s quite possible that o3 is also just some (or a combo of) gpt-4 variant(s) under the hood with a massive budget for solution space exploration. (look at the other winners papers ARC Prize 2024)

soo, a big fat nothin ?

qrdl · December 31, 2024, 8:25am

Kaggle comps are overfitting monstrosities, so I wouldn’t get too excited about their SOTA.

88% versus 75% is still a big deal. Eeking out those last few percentages gets exponentially harder on every benchmark.

Also, I’ve heard that some people are saying that the AI often finds corrections that are wrong in the eval sets.

Diet · December 31, 2024, 10:26am

well yeah, if the model gets it wrong, the developers take a look at the failures to see why. then they tune the model

and go again. and again, and again

and in the process of tuning for those that remain, they (developers) discover problematic questions simply by virtue of them not being solvable. That’s typically a point at which the benchmark starts becoming useless and obsolete.

deeplearningtoll · January 5, 2025, 4:47am

Is AI chasing AGI and big numbers on bemchmatks just a modern version of the Tulip Bulb craze?

I feel iits a bit of both

Diet · January 5, 2025, 4:51am

That’s a pretty good analogy, maybe

The beauty is, that if you know, you can choose to have it not affect you.

Topic		Replies	Views
What is the impact of DeepSeek on the AI sector? 🔥 Community o1	166	11296	February 16, 2025
Situational-awareness.ai, a brief writeup by Leopold Aschenbrenner Community chatgpt	11	26944	December 30, 2024
Open Source is making rapid progress Community agi	21	2213	July 24, 2024
Is OpenAI Falling Behind in the AI Arms-Race? Community openai	45	2759	January 17, 2026
Day 12 of Shipmas: New frontier models o3 and o3-mini announcement Community shipmas	71	9213	December 26, 2024

Is o3 just huge api spend?

Related topics