GPT-4o - Hallucinating at temp:0 - Unusable in production

jimbo · May 14, 2024, 7:12am

Hi folks,

We’re finding that 4o is hallucinating fake information into quite strict summarisation prompts which work perfectly in gpt-4-turbo and gpt-3.5-turbo-0125

NB. running at temperature = 0 in case that’s relevant.

Our context is summarising a json object of documents (including content, dates, and local urls). 4o invents fake urls – and maybe invents facts too: we’re checking that.

The hallucination is consistent.

Very happy to share the full prompt (or our API debug patterns) internally. Not super-happy to copy and paste here tho.

Is anyone else seeing reliability issues?

jr.2509 · May 14, 2024, 7:35am

I personally have not seen hallucinations so far but I have observed a different model behaviour in response to instructions relative to previous models based on limited testing so far - which in all fairness is not entirely unexpected.

So perhaps you might need to play around with the prompt a little bit to achieve the same outcome.

ayansengupta17 · May 14, 2024, 7:37am

Would you be able to share an example? I haven’t experienced this till now with gpt-4o

whyman · May 14, 2024, 7:44am

I am having a similar issue with GPT 4o, when returning a JSON object at temperature 0. GPT 4 can format it properly each time. GPT 4o will eventually start sending back responses that seems to ignore instructions. I am a fairly new developer, so I am not that good at explaining but wanted to confirm the issue. I will stick to 4 for now.

sergeliatko · May 14, 2024, 8:17am

I can’t find the thread now, but if I remember right, temp 0 allows OpenAI select the “appropriate” temperature, which might be other than 0… Personally, I often had issues with temp 0, so I usually make it 0.01 or lower.

eudard · May 14, 2024, 8:59pm

Example promt:

Maybe this is only bug in search algorithm, but the prices are allways imaginary.

jim14 · May 15, 2024, 3:49pm

Tested 4o this morning on a prompt sequence that 3.5 consistently returns good content for. First 2 attempts with 4o returned text that devolved into gibberish. Gave up, assuming it’s not ready for production. Didn’t alter temperature but maybe I’ll go back and try that.

jr.2509 · May 15, 2024, 3:57pm

My best guess is that certain types of browsing activities that are akin to web scraping, in this case price scraping, are not supported (as in not allowed by ChatGPT). So what is instead does is it identifies the product names based on the information in the URL and then performs a regular search and picks the prices from one of those sources.

field_b · May 20, 2024, 10:21pm

Also experiencing hallucinations with 4o in a RAG chatbot. 3.5 performs accurately enough, but 4o was too unreliable in our dev environment so we just rolled back to 3.5 without moving 4o to higher environments. Some use cases of the bot are answering domain questions, suggesting actions, reviewing user work, writing scripts, and explaining contextual objects like errors or activity. In all cases 4o is consistently responding with overly verbose (even when explicitly told to be succinct) output that contains hallucinations in a significant proportion of responses, well over 50% of the time.

pantaleone · May 20, 2024, 10:26pm

I have been using these models for a few years now and the fact they struggle to simply say they’re unsure about something and instead response with an hallucination confidently is beyond me!

#BringBackSky

caydennormanton · May 21, 2024, 4:44am

I’ve been noticing more errors too. For example, given a set of 9 letters (Countdown-style), and asked to find the longest word in English, 4o will often use a letter twice or include a letter not in the set. One time it even made-up a word.

This is significantly more likely if, say, you provide a word and ask 4o to find a longer word, but no (valid) words can be made from the set of letters (with more letters than the provided word).

douglasw · May 21, 2024, 7:38pm

Sometimes it helps to tell the model what not to do so :

I would add to your prompt: do not use prior knowledge, do not search additional websites for cheaper prices, only use the data from the link provided.

chat-gpt1 · May 28, 2024, 2:40pm

We tested GPt-4-o on 100 question in a German RAG-Pipeline. While GPT-4 was reliably answering, when ever the right chunks were presented, GPT-4-o came up with the default answer “chunk nit found” in >30% of the cases, where gpt-4 produced a satisfying answer. We will not recommed it for usage to our customers at the moment!

anon34024923 · May 28, 2024, 3:12pm

Have you considered using 3.5?

jimbo · June 7, 2024, 8:06pm

We do…

The rationality of the output prose is marginally stronger with 4o which provides advantages … but the unreliability precludes us from what would otherwise be an upgrade.

anon34024923 · June 8, 2024, 2:01pm

So what are you complaining about? Use the 10x cheaper model that works better. Try and remember everyone on this forum is still like miles ahead of the general public on AI. You don’t need 4o to compete with your product. It’s still super early in the game, remember that

jimbo · June 11, 2024, 7:13am

Thanks. I’m not complaining - I’m reporting an issue (in an attempt to be helpful to the development team).

FWIW, given our use-case, the 4o output is of value. I would use 4t were it not for the cost/speed implications. Hence, being able to use 4o would be helpful to our readers, allow us to justify netting OpenAI more revenue for our API calls etc. Everyone wins … but there appears to be unintended issue/bug that precludes us from making the change, that’s all.

Hope that helps you to understand.

anon34024923 · June 11, 2024, 1:17pm

Understood. I don’t think the dev team really goes on here much from what I’ve heard through the grapevine.

I try to see it as “Would OpenAI release bad products unless they absolutely had to?”

They just are bluffing and don’t really have much better stuff right now. And if they do they don’t share it in the name of “safety”.

But I know there’s some product testing going on internally right now.

Truly I hope we do see a release that isn’t buggy and really dwarfs gpt-4 but it hasn’t happened yet.

They’re also a new company so, maybe in time it’ll all be fleshed out.

For now I’d just worry about what works best for you I wouldn’t worry about “making everyone happy” OAI will be just fine with or without your business and they’re only going to care about you as a customer if you are doing tens of millions in tokenage a month.

rayfarer · June 12, 2024, 1:26am

I have the same issue. It will send good results maybe 4/5 times. But that 20% makes it totally unusable because it sends back random garbage, seemingly ignoring instructions and temperature.

Prompt engineering might help, but I don’t find this model useable in production yet.

Now, one thing I did notice is that it’s super fast, which is really cool. But I would take a slower and consistent assistant over a fast but unreliable one.

PaulBellow · June 12, 2024, 4:02am

How complex are your prompts? Multi-steps?

Topic		Replies	Views
Gpt-4o hallucinates a lot Community api	27	5199	December 2, 2024
Complex Prompt Getting Continuously Worse Results Prompting api , gpt-4-vision , assistants-api	6	843	July 24, 2024
Has regular gpt-4 model changed for the worse by any chance? Community gpt-4 , hallucinations	12	1737	April 23, 2025
Surprising spelling and grammar issues -> turned out a jailbreak vector Community	32	6502	June 8, 2023
GPT-4 becoming dumber sometimes, for a while API	7	2696	December 18, 2023

GPT-4o - Hallucinating at temp:0 - Unusable in production

Related topics