I 100% get that what you’re doing is different. I was bringing this up as more of a complimentary idea.
E.g. if there are certain conditions where you are currently using 4o in your later reasoning stages where 4o-mini would suffice those later more expensive stages could be dynamically routed to the cheap or expensive model as necessary.
Depending on the scale at which you’re going to ultimately be operating and how frequently you might be able to offload the expensive reasoning stages to the cheaper model, I imagine you might be able to drive that $4 for 10 Mtok down to $3.75 or $3.50.
That said, RouteLLM is very niche and new, and certainly wouldn’t be without cost to replicate, extend, fine-tune, or run for your specific use case, so I’m not suggesting it as something you try to implement any time soon. I was just checking if it was something on your radar yet because I foresee a time when there is a much more mature version of a router model which could optimally route a message and context to any number of possible future models.
For instance, maybe there would occasionally be some early stages that could be handled by even cheaper commodity models like the Phi series or would be substantially better handled by slightly more expensive models like an imagined future 3.5 Haiku which could possibly have some profoundly positive downstream effects.
Anyway, it sounds like it’s already on your radar.
![]()