GPT 4o mini took a hit ever since o1 was released

Prompts that used to give good results for weeks are now breaking.
Anybody else noticed this?

Structured Output also behaves rather strangely, but maybe not any more than prior.

3 Likes

Hi and welcome to the developer forum!

Can you give some specifics? A prompt and the old response and the new response that does not meet expectations would be great.

2 Likes

(related evidentiary topic)

1 Like

Thanks, I’ve raised it with OpenAI.

2 Likes

Happy to report things appear to have returned to baseline at some point yesterday.

It seems we had to migrate to Structured Output, as instructions included in some system prompts lost some of their strength after the release of o1.

Of course, this attribution of cause is probably wrong if there is no overlap between o1 and 4o mini. (i.e., the release of o1 may have been a coincidental time marker)

1 Like

I wanted to add to this. I’ve encountered a significant degradation n performance of GPT 4o (not mini) between 9/12 and today, 9/14.

I use the api to generate creative writing output. I have been using a hyper-specific mega prompt and specific hyperparameters, and since the mid August 4o release, I’ve had insanely high quality output. Both 4o Latest and the August endpoint followed my mega prompt instructions perfectly. I never had hallucinations. I never had deviations. I would routinely receive 1,500-2000 words of the highest quality literary prose.

My last successful generation was 9/12, Thursday evening. I did not work Friday. Saturday evening, I attempted to get back to work.

Every api call was a disaster.

The model I primarily used - 4o Latest - no longer follows my system message (the mega prompt). It massively hallucinates. After about 300-400 words, the output descends into word salad.

This is the exact same prompt I’ve run day in and day out for a month, with 100% success, prior to this evening.

I changed the endpoint to 4o Aug & 4o May. They returned equally as poor outputs when they had also performed exceptionally prior to today.

So, in one day, the model’s performance dropped from consistent, reliable, stable performance to completely unusable output.

Instructions are no longer followed.
Hallucinations are extremely high.
Word salad happens every time.
Output is severely truncated. From 2000 excellent words 400 terrible ones.

I attempted to adjust hyperparameters. While I can turn down top P enough to end the word salad, that has gutted the soul & spirit of the prose. And no hyperparameter adjustment is fixing the model’s lack of instruction following.

Something has changed. Stable, reliable api calls that functioned perfectly for a month are no longer working.

I do also use the gpt chat app on iOS, and I have not experienced any issues there. My problems are only with the api calls I’m attempting.

To restate: I’ve had zero successful api calls. This isn’t a one-off, or wait a bit and retry, or the model just needs a re-roll. I spent hours this evening trying to problem solve and troubleshoot.

I am also part of a community of writers, and several other members have reported the same issues.

What could possibly be going on? Is this fixable?

2 Likes

I think it’s just that the balance of power, or the effect of system messages has shifted a bit.
It seems particularly true with the Assistant API, where the “instructions” property set at the level of the assistant is now more susceptible to being overridden by the content of subsequent user or assistant messages.

In general, guidance on the relative importance of the various message types could be improved.

It would also be good to ensure that relative importance remains stable across models.

Either this, or maybe system messages are just a nuisance and should be eliminated.

I appreciate your input. I’m definitely hoping for a concrete answer on if/what has potentially changed regarding system prompt importance & weight.

I’m hoping @Foxalabs will be able to chime in.

Again, appreciate your input. Thank you.

OpenAI is continually fine tuning their models using RLHF. That means that weights in the layers they’re tuning are always slightly shifting around. This generally leads to model improvements but can occasionally result in a prompt that was working to suddenly stop working.

No one outside of OpenAI knows exactly how often these updates are applied but I’ve heard rumors that it could be as frequently as several times a day. I think we’ve all seen instances of prompts that don’t work and then magically the next day they work. Sometimes it goes the other way.

1 Like

I have also encountered this issue. Usually I need 4o to help me get the words and Chinese pinyin in the pictures, but I noticed that from yesterday it just can’t scan the characters in the pictures correctly. The characters it generated is totally different from the pictures and the meaning was also changed ridiculously. I still don’t know why now