Medium Post: Grounding LLM's - Part 1

Yes but what we’re doing is very very different… the our algorithm approaches reasoning is radically different and we’re able to split reasoning into phases. I’m able to run the earlier phases that see the bulk of the tokens using gpt-4o-mini and the later phases that actually do the bulk of the reasoning using gpt-4o. The result is you can apply gpt-4o level reasoning to up to 10 million tokens (we’ve successfully done 12 million) but 9.5 million of those tokens will be processed by gpt-4o-mini so the average cost for that 10 million token prompt is only around $4 including both input and output tokens. And there’s no significant quality loss. You would have gotten the same basic answer running all 10 million tokens through gpt-4o but it would have cost you closer to $50

1 Like

I 100% get that what you’re doing is different. I was bringing this up as more of a complimentary idea.

E.g. if there are certain conditions where you are currently using 4o in your later reasoning stages where 4o-mini would suffice those later more expensive stages could be dynamically routed to the cheap or expensive model as necessary.

Depending on the scale at which you’re going to ultimately be operating and how frequently you might be able to offload the expensive reasoning stages to the cheaper model, I imagine you might be able to drive that $4 for 10 Mtok down to $3.75 or $3.50.

That said, RouteLLM is very niche and new, and certainly wouldn’t be without cost to replicate, extend, fine-tune, or run for your specific use case, so I’m not suggesting it as something you try to implement any time soon. I was just checking if it was something on your radar yet because I foresee a time when there is a much more mature version of a router model which could optimally route a message and context to any number of possible future models.

For instance, maybe there would occasionally be some early stages that could be handled by even cheaper commodity models like the Phi series or would be substantially better handled by slightly more expensive models like an imagined future 3.5 Haiku which could possibly have some profoundly positive downstream effects.

Anyway, it sounds like it’s already on your radar.

:+1:

1 Like

Hey , Could you share the process of creating this blend of gpt-4o and gpt-4o-mini? I am processing tokens in millions as well which is costing me too much.