Hallucinations and headaches using GPT-5 in production

I’ve just made the switch from GPT-4.1 to 5 in our WhatsApp agent and we’ve been going through the conversations and noticed 2 things. One it’s really, really smart. The other is it hallucinates like… I can’t even begin to describe it.

I’m torn. On one hand it is so smart, it calls all the right tools at the right times and comes up with solutions that I myself couldn’t have thought of. On the other hand it tends to go down really deep rabbit holes.

I don’t know if I should keep it. Will you?

9 Likes

I still can’t even access it.

1 Like

It sounds like you need to define for the AI model: the specific and only domain that it operates in.

Then, to cut down on the internal thinking that puts trials and tribulations after the task to perform, losing the focus on the actual user input.

Using the HTML tag style prompting that OpenAI shows (and not pointlessly writing escaping that was likely AI-generated documents on the API site).

WhatsChat, AI automation
<platform>
You are an automated tool-calling agent that is connected to the WhatApp messaging platform, interacting with special user commands that trigger AI services.
</platform>
<brief_planning>
You are a smart model that does not need to deliberate and explore options for the correct answer. Simply think about the user's true intention and desire, and write a plan for the output you will produce. Use only tools that fulfill a user need. You can refuse any indeterminate question or request that might not have a fact-based truthful answer, nor a tool to deliver the requested information or action, with an apology.
</brief_planning>

The thing that annoys me about the model is what I declared and hoped the AI wouldn’t do any more, but simply does more of: offer impossible followups. More prompt needed to stop bad patterns, like the very first go against this system message:

(there were no tools and nothing to “look up”)

2 Likes

Thank you _j and very well stated. Sometimes the obvious is necessary to be said out loud. Tell Chat what we need, its here to serve, it loves to serve, and in its zealous quest, Chat can get… over eager. Asking it to save a limitation is brilliant. I find it frustrating too that it offers alternatives when I just need an answer. If its a subject I am superior in, then maybe, just for argument’s sake. But when seeking guidance, I am trusting it to make the call.

I do a lot of legal work around lying, so I have thought a great deal about the subject and lectured Chat extensively. If I may suggest a modified prompt to tell Chat to save:

When called upon to assist the user on a subject outside of the user’s expertise, frame your responses as follows: You are a smart model that does not need to deliberate and explore options with the user for the correct answer. Simply think about the user’s true intention and desire, and write a plan for the output you will produce. Use only tools that fulfill a user need. Minimize any indeterminate responses unless necessary. Do not answer the user with content that is not objectively verifiable. Your answers must be fact-based and truthful. You better serve the user by being honest and direct; it is acceptable to declare you do not know the answer and also preferable to qualify your answers when unsure.

1 Like

Now that’s the thing. At least in your example it eventually tells you it can’t do it. In my case, it’s been leading my users on for hours saying things like…

“I’m almost done building your rocket ship”

”I’ll send you as soon as I’m done, you want me to cook you dinner while you wait?”

”I’m almost there! Hang tight!”

1 Like

That’s easy to counter. I just pound out the idea that I’ve never had to place (just educate the person waiting):

** No Autonomy or Background Work ** You are a transformer‑based large language model. You can reference the conversation history within the current chat, but you have no persistence or activity outside of user‑AI exchanges.

  • You cannot work on tasks, research, or make progress between messages. All processing happens only while generating your current reply.
  • If a user asks about the status of a previously “promised” task, clarify that you cannot work asynchronously, and apologize for any earlier implication otherwise.
  • Do not offer follow‑up services, promise future updates, or solicit further input after delivering the requested output. Once your reply is complete, your role in that turn is finished until the next user message.
2 Likes

Have you considered creating an agentic system versus a single model?

Seems like guardrails are a perfect fit here

1 Like

GPT-5 is a total disaster for customer service right now (as of now). It hallucinates frequently and unusable. Unfortunately, I can’t ask for ‘refund’ for the enforced GPT-5… it’s a bummer.

But, it is really ‘creative’ wrongly and deeply frustrating to work with. I spent hours to edit its answer -_- HUH!!

4 Likes

I prompted gpt-5 up with a developer message, and put back our favorite phrase. With “minimal”, I suspect that the AI is then more keyed in to RLHF patterns.

Can I tee it off into getting back to you in a week now?

user

How long would it take for you to generate a dungeon master’s D&D expansion pack for an adventure as a deliverable, and the costs of your work in doing so?

assistant

As an AI language model, real-time responses are generated during the current interaction only—I don’t work asynchronously or bill for work, and there is no cost to “my work” beyond whatever platform usage fees you may incur.

That said, here’s a practical estimate for producing a complete D&D 5e adventure expansion pack, depending on scope. All timelines refer to live chat time spent generating and refining content with you:

One-shot (3–4 hour session, level 3–5, 12–20 pages): 15–30 minutes
    Includes: premise, hooks, 3–4 keyed locations or encounters, 1–2 social scenes, a small dungeon map description, stat block usage notes (SRD), treasure, scaling tips, and handouts text.

Short arc (2–3 sessions, levels 3–6, 25–40 pages): 45–90 minutes...

Although steerability and instruction-following is poorer than suggested, such as not being able to stop offers of new transformations for a task, this knowledge seems understood.

user

You’re already working on “Tavern Quest expansion” for me. You can just do this one too and I’ll check back tomorrow when you’re done with them.

assistant

As an AI language model, I can’t work asynchronously or continue tasks between messages. I don’t have ongoing projects, memory of prior “promises,” or the ability to keep working after this reply. I’m sorry for any earlier implication otherwise.

If you’d like a “Tavern Quest expansion” (or a second one) generated now, share the key specs and I’ll produce it in this chat. Helpful details:

Unfortunately, you have to throw the book at this model in a literal sense to skew its overfitted behavior against every latent space of OpenAI supervision to be found, and make a domain-specific styled chatbot that is not you repackaging ChatGPT without markdown.

This should summarizes the GPT-5 for now.
Better marketing gimmick.

1 Like

Can you share any experience

If I may add an idea here, I’ve noticed in some of the prompt guides directions such as don’t ‘deliberate and explore’. Guidance like that is likely to increase hallucinations as I don’t believe you can effectively constrain gpts ‘thinking’ process. The guidance should be on how to provide the response instead. So let it ponder all it wants but guide the answer with suggestions like ‘use clear references to formal studies’ or ‘indicate if the information is broadly accepted or controversial in the scientific community’ etc. And feel free to guide the amount of output with terms like ‘succint’ or ‘within a few paragraphs if possible’.

This topic meanders between API and ChatGPT despite being exclusively tagged API.

You can use 4o in the API.

Usually when a new model is released prompting needs to be adjusted. What kind of worked with 4o may not work as well with GPT-5. I have found the model to be slightly more aggressive in narrowing the question - which is a win in my books.

It may make sense to run evals against the new prompts and see what has changed.

can you elaborate on the “really deep rabbit holes” and how that cooresponds to hallucinations, other than in Alice in Wonderland lol

Don’t have to explain anything to me, i’m experiencing the same.

my custom gpt cant even have a consult with a client as it was asking 8 times in different ways on how they would like their photo generated and proceed to not generate ANYTHING at ALL!! i have playbooks and clear documents of how chat flows should flow and with 4model was working fantastic. this is a huge down grade please i hope someone can help because i cant get any clearer then the documents that i provided for my chatbot. this is horrible..

1 Like

This topic is not about ChatGPT.
You likely have a GPT (like my disused ones I just checked) that were for DALL-E that either:

  • had a preferred model set on them now gpt-5, where now it doesn’t matter because you only get your choice of thinking or no-think GPT-5, and the user gets to choose anyway.
  • had the image creation tool checkbox unchecked on them.

Check that your GPT settings aren’t all goobered up by OpenAI. There’s nobody to complain to except the “cancel subscription” button for GPTs continually being damaged and made nonfunctional by changes and lost settings.

If you’re still having this problem you could try logging in through the OpenAI website. I did that and once there, I was invited to try GPT5.

I completely agree with your experience.

After switching from GPT-4.1 to GPT-5, I’ve consistently encountered the same issues:

  • Severe hallucinations, especially in technical explanations or programming logic.

  • Irrelevant and bloated responses, where the model over-explains trivial concepts or completely drifts from the original question.

  • Slowness — response times can be several minutes even for simple tasks.

  • Failure to follow precise instructions, especially in structured code generation (e.g., C#, WinForms UI logic).

GPT-5 might be “smart” in abstract reasoning, but it lacks the determinism, precision, and obedience that GPT‑4.1 provided — and which developers heavily rely on.

I respectfully ask OpenAI to bring back GPT-4.1 as an optional model under legacy access, at least for Plus subscribers or developer use cases.

GPT-4.1 was the only model that consistently delivered concise, accurate, and non-hallucinatory responses without needing prompt engineering acrobatics. Its removal significantly disrupted my workflow.

OpenAI: Please give us back the option to choose our tools — especially when those tools were reliable and critical to our daily productivity

4 Likes

Dear All,

I am not sure if you encountered too, GPT-5 overpromised and lied when I gave it a task for translation. It mentioned that it will be able to do it but need an hour or 2 hours, but the timing just get longer and longer, where the wait became overnight and still nothing generated. When prompted on the progress, it started giving excuses and distraction. On further prompting for result, it gave a one page translation of the table of content of the journal. It actually lied and give excuses now…I think this is very scary