Feedback for OpenAI Developers on Automatic Model Selection

Dear OpenAI Development Team,

I would like to share an important observation regarding the user experience when selecting AI models. Currently, the system requires users to manually choose a model before asking a question. This approach can be impractical and inefficient, especially for those who are unfamiliar with the specific differences between models.

Suggested Improvement

It would be more effective to implement an automatic model selection system based on the type of question asked. This could be achieved by:

Allowing the AI to analyze the question and automatically determine the most suitable model.

Providing an option where users can let the AI choose the best model without manual selection.

Offering recommendations when typing a question, such as:
“GPT-4o has been selected as it is the most suitable for your query.”

This enhancement would significantly improve the user experience, particularly for those who may not know which model best fits their needs. It would make the interaction smoother and more intuitive.

Thank you for your continuous efforts in advancing AI technology!

Best regards,

That would be interesting. But perception is also very subjective to determine.
What comes to mind is to have a pipeline where you first ask gpt-4o to determine if the prompt is about STEM, which benefits the most on reasoning models, and depending on the anwers it decides which model to go for.

TL;DR: This idea is basically premature optimization. Literally everything should use “GPT-4o mini”, except if it is a complex questions (that most people are not capable asking). How would you feel if I say your question is not worthy to be looked at by a “smart llm”? - that is why it does not exist.

First of all, there is such a thing as RouteLLM, that pretty much exactly does that and more, encompassing also other models from different providers. Reason to use it is cost optimizations.

The answer to ‘What are atoms made of?’, will vary significantly depending on educational context, as scientific concepts are taught through progressively refined models. Early education often introduces simplified frameworks (e.g., protons/neutrons/electrons as indivisible particles) to align with students’ cognitive development, reserving deeper complexities like quantum chromodynamics, quark-gluon interactions, and probabilistic electron orbitals for advanced study. This pedagogical scaffolding reflects not deception, but the necessity of building foundational intuition before confronting counterintuitive truths inherent to quantum physics and particle science. Critically, even modern ‘complete’ explanations remain provisional, as our understanding evolves with ongoing research into subatomic structure.

If a more educated person asked the same question, it could look like: “Under the Standard Model of particle physics, how do quantum chromodynamics (QCD) and quantum electrodynamics (QED) collectively describe the substructure of atoms, including emergent phenomena like confinement, asymptotic freedom, and the role of virtual particles in mediating interactions between quarks, gluons, and electrons?” - note that they likely also know about String Theory, Loop Quantum Gravity, Grand Unified Theories (GUTs), Supersymmetry, Emergent Gravity (like Verlinde’s), Holographic Principle, Preon Models, Quantum Foundations (like Pilot-Wave), and Digital Physics or Panpsychism, but they are more explicit in what they want.

Lets say you try all the models and quality of the answers will be different. Your goal is to get the best answer, but best by what metrics: is it speed, accuracy, references to source material, cost, interpretability, scalability, falsifiability, novelty, creativity or something else. Note also that each have their own parameters and even if you say it is for academic research, and from that can imply that accuracy, references, falsifiability are most important and when generating a response anything below a peer-reviewed particle physics prefers QCD predictions validated by lattice simulations, can be ignored (as it is more like pseudoscience for fantacy writers), most of the LLM training data is not up to par with that standard. So it will hallutionate belivable results.

So lets say you have a question that needs “complex” and accurate response, and you know that training data does not likely have it. Well deep research will likely read up several papers and do a lot of manual labor for you, and this can create “synthetical wisdom” using RAG style feedback learning.

Hard pill to swollow: Modern LLMs are engineered to handle most mainstream queries effortlessly - not because users lack intelligence, but because the majority of human inquiries operate within predictable domains (factual recall, basic reasoning, templated workflows). The true frontier lies in specialized interrogation : questions demanding domain expertise, multi-step synthesis, or adversarial testing of a model’s latent knowledge.

My advice: for most practical applications, start with streamlined ‘mini’ models - they optimize for cost-efficiency and speed while retaining sufficient performance for routine tasks. Reserve heavyweight models only for edge cases requiring deep reasoning or niche expertise.

Engineer Complaint : “Mini models fail my coding questions!”
Actual Workflow : ⌘C → ⌘V → “Fix this” → Complain about hallucinations.

Models can generate optimized SQL/indexed solutions if properly constrained (e.g., "Write PostgreSQL query for X, optimize for read-heavy OLAP, use BRIN indexes. Code-Specific Models will outperform generalist LLMs on coding tasks (usually). There are benchmarking sites that actually measure how good a model is at answering question within a specific domain.

and there are various others.