GPT-4o - Hallucinating at temp:0 - Unusable in production

Hi folks,

We’re finding that 4o is hallucinating fake information into quite strict summarisation prompts which work perfectly in gpt-4-turbo and gpt-3.5-turbo-0125

NB. running at temperature = 0 in case that’s relevant.

Our context is summarising a json object of documents (including content, dates, and local urls). 4o invents fake urls – and maybe invents facts too: we’re checking that.

The hallucination is consistent.

Very happy to share the full prompt (or our API debug patterns) internally. Not super-happy to copy and paste here tho.

Is anyone else seeing reliability issues?

3 Likes

I personally have not seen hallucinations so far but I have observed a different model behaviour in response to instructions relative to previous models based on limited testing so far - which in all fairness is not entirely unexpected.

So perhaps you might need to play around with the prompt a little bit to achieve the same outcome.

Would you be able to share an example? I haven’t experienced this till now with gpt-4o

I am having a similar issue with GPT 4o, when returning a JSON object at temperature 0. GPT 4 can format it properly each time. GPT 4o will eventually start sending back responses that seems to ignore instructions. I am a fairly new developer, so I am not that good at explaining but wanted to confirm the issue. I will stick to 4 for now.

1 Like

I can’t find the thread now, but if I remember right, temp 0 allows OpenAI select the “appropriate” temperature, which might be other than 0… Personally, I often had issues with temp 0, so I usually make it 0.01 or lower.

2 Likes

Example promt:

Maybe this is only bug in search algorithm, but the prices are allways imaginary.

1 Like

Tested 4o this morning on a prompt sequence that 3.5 consistently returns good content for. First 2 attempts with 4o returned text that devolved into gibberish. Gave up, assuming it’s not ready for production. Didn’t alter temperature but maybe I’ll go back and try that.

1 Like

My best guess is that certain types of browsing activities that are akin to web scraping, in this case price scraping, are not supported (as in not allowed by ChatGPT). So what is instead does is it identifies the product names based on the information in the URL and then performs a regular search and picks the prices from one of those sources.

Also experiencing hallucinations with 4o in a RAG chatbot. 3.5 performs accurately enough, but 4o was too unreliable in our dev environment so we just rolled back to 3.5 without moving 4o to higher environments. Some use cases of the bot are answering domain questions, suggesting actions, reviewing user work, writing scripts, and explaining contextual objects like errors or activity. In all cases 4o is consistently responding with overly verbose (even when explicitly told to be succinct) output that contains hallucinations in a significant proportion of responses, well over 50% of the time.

I have been using these models for a few years now and the fact they struggle to simply say they’re unsure about something and instead response with an hallucination confidently is beyond me!

#BringBackSky

I’ve been noticing more errors too. For example, given a set of 9 letters (Countdown-style), and asked to find the longest word in English, 4o will often use a letter twice or include a letter not in the set. One time it even made-up a word.

This is significantly more likely if, say, you provide a word and ask 4o to find a longer word, but no (valid) words can be made from the set of letters (with more letters than the provided word).

Sometimes it helps to tell the model what not to do so :

I would add to your prompt: do not use prior knowledge, do not search additional websites for cheaper prices, only use the data from the link provided.

We tested GPt-4-o on 100 question in a German RAG-Pipeline. While GPT-4 was reliably answering, when ever the right chunks were presented, GPT-4-o came up with the default answer “chunk nit found” in >30% of the cases, where gpt-4 produced a satisfying answer. We will not recommed it for usage to our customers at the moment!

1 Like

Have you considered using 3.5?

We do…

The rationality of the output prose is marginally stronger with 4o which provides advantages … but the unreliability precludes us from what would otherwise be an upgrade.

So what are you complaining about? Use the 10x cheaper model that works better. Try and remember everyone on this forum is still like miles ahead of the general public on AI. You don’t need 4o to compete with your product. It’s still super early in the game, remember that

Thanks. I’m not complaining - I’m reporting an issue (in an attempt to be helpful to the development team).

FWIW, given our use-case, the 4o output is of value. I would use 4t were it not for the cost/speed implications. Hence, being able to use 4o would be helpful to our readers, allow us to justify netting OpenAI more revenue for our API calls etc. Everyone wins … but there appears to be unintended issue/bug that precludes us from making the change, that’s all.

Hope that helps you to understand.

Understood. I don’t think the dev team really goes on here much from what I’ve heard through the grapevine.

I try to see it as “Would OpenAI release bad products unless they absolutely had to?”

They just are bluffing and don’t really have much better stuff right now. And if they do they don’t share it in the name of “safety”.

But I know there’s some product testing going on internally right now.

Truly I hope we do see a release that isn’t buggy and really dwarfs gpt-4 but it hasn’t happened yet.

They’re also a new company so, maybe in time it’ll all be fleshed out.

For now I’d just worry about what works best for you I wouldn’t worry about “making everyone happy” OAI will be just fine with or without your business and they’re only going to care about you as a customer if you are doing tens of millions in tokenage a month.

I have the same issue. It will send good results maybe 4/5 times. But that 20% makes it totally unusable because it sends back random garbage, seemingly ignoring instructions and temperature.

Prompt engineering might help, but I don’t find this model useable in production yet.

Now, one thing I did notice is that it’s super fast, which is really cool. But I would take a slower and consistent assistant over a fast but unreliable one.

How complex are your prompts? Multi-steps?