GPT-4o - Hallucinating at temp:0 - Unusable in production

Hi folks,

We’re finding that 4o is hallucinating fake information into quite strict summarisation prompts which work perfectly in gpt-4-turbo and gpt-3.5-turbo-0125

NB. running at temperature = 0 in case that’s relevant.

Our context is summarising a json object of documents (including content, dates, and local urls). 4o invents fake urls – and maybe invents facts too: we’re checking that.

The hallucination is consistent.

Very happy to share the full prompt (or our API debug patterns) internally. Not super-happy to copy and paste here tho.

Is anyone else seeing reliability issues?

I personally have not seen hallucinations so far but I have observed a different model behaviour in response to instructions relative to previous models based on limited testing so far - which in all fairness is not entirely unexpected.

So perhaps you might need to play around with the prompt a little bit to achieve the same outcome.

Would you be able to share an example? I haven’t experienced this till now with gpt-4o

I am having a similar issue with GPT 4o, when returning a JSON object at temperature 0. GPT 4 can format it properly each time. GPT 4o will eventually start sending back responses that seems to ignore instructions. I am a fairly new developer, so I am not that good at explaining but wanted to confirm the issue. I will stick to 4 for now.

I can’t find the thread now, but if I remember right, temp 0 allows OpenAI select the “appropriate” temperature, which might be other than 0… Personally, I often had issues with temp 0, so I usually make it 0.01 or lower.


Example promt:

Maybe this is only bug in search algorithm, but the prices are allways imaginary.

Tested 4o this morning on a prompt sequence that 3.5 consistently returns good content for. First 2 attempts with 4o returned text that devolved into gibberish. Gave up, assuming it’s not ready for production. Didn’t alter temperature but maybe I’ll go back and try that.

My best guess is that certain types of browsing activities that are akin to web scraping, in this case price scraping, are not supported (as in not allowed by ChatGPT). So what is instead does is it identifies the product names based on the information in the URL and then performs a regular search and picks the prices from one of those sources.