Gibberish output with gpt-4o-mini

A user of my (work) app pointed out that they were getting gibberish in a specific conversation with the mini model (through the API).

I’ve been able to reproduce it through the API and through the playground. I opened a ticket but it doesn’t seem like they’re going to do anything about it.

The only way to guarantee it doesn’t do this, is to set the temperature = 1. Anything lower than that will at some point produce the same gibberish.

Sucks though because a temp of 1 returns some pretty wild responses…sometimes flat, sometimes super verbose.

Anyway, not sure if anyone has ideas on how to combat this without setting the temp to 1.

How to reproduce.
model = gpt-4o-mini
temp = 0.1 (just set it low to make sure it produces it)
max_tokens = 1000 // we’ve seen it even when allowing 10k

System: You are an AI assistant that helps people find information. // used a lot of different system prompts even no system prompt

User: What is the typical yield of steel powder in gas atomization

Asssitant: …valid response…

User: Can you provide some citations for the mass throughput and the yield values?

Assistant: …See image below…

(if you don’t get the gibberish on the follow up, just ask it the follow up question again and it usually will give it to you).

My own personal theory is that this is happening because it’s making those references up and it’s in some weird loop. I’ve been able to reproduce this many times with a lot of different conversation parameters.

Sorry if this is known issue, I tried searching and even asking the model, but no luck.

1 Like

Seems like an indicator that the model’s too weak to do what you’re asking of it.

Some things you can theoretically still try to play around with is

and

https://platform.openai.com/docs/api-reference/chat/create

This has been a longstanding “issue” - or rather phenomenon - with LLMs. This stuff has almost disappeared as a concern as models grew stronger and bigger. But with OpenAI trending towards weaker, smaller and cheaper models, this stuff is becoming more common again.

I personally wouldn’t play too much with the penalties, to be honest. I’d see if I can adjust my prompt/CoT so the model delivers smaller, bounded semantic chunks, as well as giving the model an “out” to give up on and reject requests it can’t handle.

But the easiest way to deal with this is to just use a stronger model, probably.

2 Likes

Hi @beall3 !

I tried to reproduce your problem with mini but didn’t see those issues. The issues I see are mostly that the citations it provides, seem to be completely made up (at least I can’t find them).

Until OpenAI releases the general search-with-citations functionality, I don’t think their models (neither big nor small) can be used in this way, unless you are actually doing a RAG solution yourself, or using GPTs or Assistants API with manually-curated data from Elsevier.

In the meantime you can try Perplexity - I did your query and results look OK to me, though I’m no expert! (see here)

Thanks, you make me feel better :slight_smile:

From a cost standpoint we try and push users to user the mini model for non intensive tasks, so we’re going to have to guard against stuff like this. You’re right though, with 4o this works fine.

I think the frequency_penalty might be what we need.

1 Like