Why do gpt-5.1 and gpt-5.4-mini behave so differently in production chatbot use cases?

I have been testing a model change from gpt-5.1 to gpt-5.4-mini in the Responses API, but after many tests, I feel that gpt-5.4-mini is not reliable enough for my current production chatbot use cases.

In my tests, gpt-5.4-mini loses context more easily, does not always respect the prompt rules, and sometimes applies or ignores instructions inconsistently. Compared to gpt-5.1, the difference is very noticeable.

To be clear, I understand that gpt-5.4-mini may be a good option for many workloads, especially considering the lower price. However, in my customer service chatbot scenarios, where prompt-following, context retention, business rules, and tool execution decisions are very important, the results have been significantly worse than with gpt-5.1.

So far, gpt-5.1 has been the best model I have found for chatbot customer service scenarios using the Responses API. I was also reading another post in the community and saw other people reporting similar behavior.

On paper, the models look somewhat similar in terms of general capabilities and context size, but in real production chatbot usage I am seeing a big difference in reliability.

Has anyone else experienced the same difference between these two models?

Also, is there another model you would recommend for chatbot/customer service use cases with the Responses API, especially when instruction-following, context retention, and tool usage reliability are very important?

Below is a comparison of the two models. On paper they look quite similar, but in real usage I am seeing a big quality difference.

The mini models are almost always less capable because they have significantly lower precision - they use orders of magnitude less parameters. The trade-off is speed and price.

I personally find the nano models unusable for my use cases for this very reason. Mini models can be a good compromise for certain purposes and you will have to make that judgement for yourself.

For my personal chatbot tasks I use the full size frontier models. It saves tokens in the end of the day because the model is more likely to one shot.

Same goes for many production tasks. I don’t want to have to review and rerun things because a smaller model failed.

I have seen similar feedback from others too. As @merefield mentioned, the tradeoff with the mini models is usually speed/cost vs reliability.

In real customer support chatbot flows, things like context retention, instruction following, and tool decisions tend to matter a lot more, so it makes sense that gpt-5.1 feels more reliable for your use case, especially in longer or more rule-heavy conversations.

When looking at reasoning models, it is also important to factor in reasoning effort.

I am referencing the Artificial Analysis Intelligence Index for these two models to provide a comparison. Lower reasoning effort naturally translates to lower benchmark scores, but also lower latency.

Artificial Analysis Intelligence Index:

GPT-5.1 high: Score 48
GPT-5.4 mini xhigh: Score 49
GPT-5.4 mini medium: Score 38
GPT-5.4 mini non-reasoning: Score 23

Hope this helps sharpen the perspective on these different models.

@VeitB You touched on a very interesting point. Actually, I haven’t tested changing reasoning.effort on either model yet. I thought about it, but I didn’t try it in practice.

From what I found in the OpenAI docs, gpt-5.1 supports none, low, medium, and high, and its default is none. For gpt-5.4-mini, the model page shows support for none, low, medium, high, and xhigh, but I didn’t find an explicit default stated there.

Do you recommend testing gpt-5.4-mini with a higher reasoning.effort, such as high or xhigh?

My main concern is: if I increase reasoning.effort on gpt-5.4-mini, could it end up with similar latency/cost conditions as gpt-5.1, reducing the advantage of using the mini model?

Yes, I would definitely recommend testing different reasoning levels and model combinations to find the right balance between cost, quality, and latency.
Even GPT-5.5 with reasoning set to none or low could be an option.

Ultimately, you will need to evaluate which model and reasoning combination works best for your use case, either through proper evals or by testing it directly.

I would not rely on the scorecards or generic indexes/benchmarks for production use cases. Use a strong model to build yourself a dataset of a few hundred examples for your specific use case and then benchmark your use case (for quality, speed, cost) with a few models and parameters. You’ll be surprised of what you discover.

Yeah, in experimental phases for new features on Production I do similar. Start with large model, fine tune the code and prompts until I’m satisfied, then later step down the model via settings and see if I can retain acceptable behaviour until I find unacceptable cases, if any, then step back up.

You could do this in some kind of staging environment too if your risk tolerance is less, of course, but seeing results against production data is not really negotiable.

What do you mean real” supported chat flows? This is an 80,000 user test to see what can be made because of the sheer number of dev projects made by someone who couldn’t code.

The tradeoff isn’t about whether you need instruction and tool decision modeling. The models can and should do that phenomenally well. The tradeoff is the that Claude chat was made in 2 weeks from Claude code with several modules not yet out.
Ot was a lot of small features added together. And I think that’s what you should try.

Inot whether r you can code and whip something fast. But if you can make something ordinary into something different, from small gains in efficiency to latency or costs.

Ultimately, if others want to work with chat or flows, instruction following, and tool decisions, I think the role of “customer support chatbot” can only work with what you bring to the prompt. Not to make it harder on you, but the ideas are what make things better: speed and reliability we solved the moment GPT generalized coding.

I agree that system design, prompting, orchestration, retries, and tooling all play a huge role in chatbot quality. A well-engineered workflow can definitely improve weaker model behavior significantly.

My point was more narrowly about the baseline reliability difference I observed between the two models under the same architecture, prompts, tools, and evaluation flows.

This is a valid concern … look at this graph as an example:

(source: AI Model & API Providers Analysis | Artificial Analysis)

Note the Green area!

But now we add in 5.5 and things change a little:

Interim Conclusions (before cost):

  • from a pure token use perspective, 5.5 looks impressive in terms of performance at medium
  • GPT 5.4 low and 5.5 low also look efficient.
  • GPT 5.1 without reasoning is dumb in comparison

Now cost:

looks like:

  • GPT 5.5 & 5.4 low might be the sweet spot.
  • good show from 5.4 nano xhigh but man that’s going to be churning some tokens!
  • GPT 5.5 non-reasoning surprisingly good!
  • GPT 5.1 without reasoning is significantly cheaper but dumb!

I leave you to play with the graphs and note I’ve not looked at latency.

I find that both are outdated and buggy as they were originally released.

I know that with codex I spend more time fixing the limited effects done by 5.4 and if just isn’t worth the pain. As for ChatGPT I don’t use it enough other than I’ll pay the Piper and stick with the better models.

Very interesting topic and information, thank you.

I am testing reasoning.effort in gpt-5.4-mini with some customers in a real case scenario. Small companies, different subjects to know how it will work.

I will return here to share what I observed about it in the next week.