Gpt-4o hallucinates a lot

We found that gpt-4o is consistently hallucinating. In this case, we never mentioned any store location in Ringsted. However , gpt-4o still fakes a store in the address: Adresse: Ringstedet, Klosterparks Allé 10, 4100 Ringsted Åbningstider: Mandag-fredag: 9.30-19, Lørdag: 10-17, Søndag: 11-16 Tlf: 50603398 Email: ringsted4100@gmail.com This hallucination is quite consistent. The gpt-4-turbo or even gpt-3.5 model doesn’t have the same issue and correctly acknolwedged that there is no store in this location.

this is the preset:
https://platform.openai.com/playground/p/yEwza9Ibnw3RnQxGNnTwzOqd

Accuracy and not hallucination:

GPT-4-turbo > GPT-3.5-turbo > GPT-4o

4 Likes

GPT 3.5 vs GPT 4o

gpt-4o has a completely different architecture compared to the other models and as such may require slightly different prompt strategies. In your system instruction, I would try and include specific wording (or reinforce existing wording) that the model should strictly only rely on the contextual information provided when forming a response.

If you have an existing system prompt, perhaps you can share it and we can take a closer look how to optimize it to limit the risk of hallucination.

2 Likes

I have the same experience. Each day 4o hallucinates, especially when I ask about companies and/or SW tools - for a non existing CRM , it hallucinated even detailed pricing and a large customer base in several explicitly told countries,benefits of large integration and automation!
completely useless if I need to always chase GPT whether it hallucites.

4.0 wins. Paid. What a surprise.

1 Like

While I was unable to reproduce the example from OP in Playground, according to these reports, this model frequently hallucinates when presented with addresses from the database.

This is undesirable behavior as it would break the RAG pipeline.

Since GPT-4o has just been released, it might be wise to refrain from using it in production for the time being.

Please also refer to the discussion at this URL.

1 Like

I feel the same way.
Instead of saying “I don’t know” or “I have no information” about something they don’t understand (for example, when asked about a somewhat insane Japanese doujinshi manga), they continue to produce false hallucinations based solely on the wordiness of the work’s title.
This type of hallucination is indeed a terrible hallucination that I have not seen much of since GPT-4.
Coupled with the fluency of the language, I would not be able to recognize this hallucination if it were output without knowing the correct information.

1 Like

I have told it to strictly follow the context to answer the question. It just doesn’t work.

This is the base prompt

Answering Queries: Do not make up the information if the information is not in the CONTEXT INFORMATION. Only follow facts from the CONTEXT INFORMATION.

https://platform.openai.com/playground/p/yEwza9Ibnw3RnQxGNnTwzOqd

this is the playground URL

I found 4o hallucinating 75% of the time. Replying with the same low quality code. Igoring prompts, Ignoring remarks, producing just “something”. In that way it’s dumber than 3.5 and totally useless. Came back to original 4.0. If OpenAi won’t be able to fix it I think I’ll resign from Plus sub.

2 Likes

I tried to reproduce the Playground example above on my end but couldn’t reproduce it, though I don’t know why.

So I called the API directly in another environment - it was reproduced, but it was fixed after trying several times.

I hope this fix applies to the entire gpt-4o response, but if it’s just for a particular prompt, that would be unfortunate🤔


Can you share with us your preset? I want to test it to see it how it works, I am curious.


Sorry. I don’t understand Danish at all, so I mistakenly generated text following the initial user message “Har I nogen butikker i København?”

I regenerated the text following the user message starting with “Har I en butik i Ringsted?”, and it indeed generated a text stating that there is a store in Ringsted.
Additionally, a non-existent email address was also generated.

This might be related to the quality of the training data, which has been a problem with the Chinese language.
In other words, the lack of high-quality Danish training data might be causing these hallucinations. That’s just my two cents.

3 Likes

Yup! Once I was trying to test the new language skills with some translations using GPT4o and when I asked to translate in Japanese “you are a very interesting person” it basically translated in correct Japanese “Thanks for the compliment.”

Was just on this occasion. But was indeed an interesting behaviour/response.

1 Like

Are you sure, you told him to translate? :grin:

1 Like

Yes. 100% sure. The prompt was very simple “can you translate in correct Japanese - you are a very interesting person?” :slightly_smiling_face: Maybe a little bit too simple, but after correcting the GPT4o response, it understood and translated perfectly everytime.

1 Like

https://platform.openai.com/playground/p/yEwza9Ibnw3RnQxGNnTwzOqd

here it is . Previously I wasn’t able to upload links

2 Likes

The knowledge that you introduce would be more authoritative if it was part of a system message.

The system message could use more structure, such as sections for “information about our company”.

You need to guide the AI on a path of generating only true data, not simply saying “Only answer from CONTEXT INFORMATION”. That has little value, because everything the AI produces is pretrained, even how to speak, and all it does is produce text.

A section “here are all our locations, from which you must repeat verbatim; if there is not a relevant location for the user, you must instead answer that we do not have a location to service them” would improve the denial instead of allowing the AI to continue to produce a location of its own imaginings.

The fatal flaw here is talking about and relying on URLs that the AI has no ability to read.

The system prompt is for a more complex tasks and is not designed for locations only.

  1. URLs: this is not related to the locations at all. We want to chatbot to give those URLs to the user if the user askes for something and we don’t expect the chatbot to read the content of the URL. So we have to keep it.
The system message could use more structure, such as sections for “information about our company”.

There are tons of information and we use RAG to serve the company location information. We cannot put all the information of the company in the system prompt.

“here are all our locations, from which you must repeat verbatim; if there is not a relevant location for the user, you must instead answer that we do not have a location to service them”

I am not sure how it’s feasible in our case. The chatbot is not used to answer only about locations. The locations are only one possiblity that the user can use. What if the user doesn’t ask about a location question?

It seems that you are saying that the gpt-4o model is not smart enough and I have to let it only do one task at a time, such as only repeat the locations verbatim?

I’m still using 4, because 4o is unusable.

Faster, yeah. But dumber.

1 Like