Changes to 4o-mini in last 24hrs that would cause performance degradation?

KW525 · September 12, 2024, 2:30pm

Hi,

I’m using an assistant with 4o-mini to evaluate images and I test edge cases to make sure it’s calibrated properly (e.g., ask it to evaluate a picture of an airplane but pass it an image of an apple). Last ~1 mo been using the same prompt and getting consistent results recognizing an image is not relevant (e.g., continuing the above example - recognizing it isn’t an image of an airplane but an apple). In the last 24 hours the model returns completely hallucinated results as if it doesn’t even evaluate any image (e.g., “the airplane image is xyz…” when in fact it’s an apple). Switching to the 4o model returns the proper response. Has anything happened or changed that would cause a sudden shift in performance?

Thanks!

_j · September 12, 2024, 3:32pm

I don’t have a fingerprint of the model from beyond 21 hours ago, but it is still the same as then, system_fingerprint='fp_483d39d857, and was the same for all requests (unlike other models that can return several fingerprints).

That can be something to look for if you are logging full requests, as that is supposed to be an indicator of determinism.

Images are subject to outside processing before they are tokenized, so that may be an API factor that could also change or break images with a keystroke.
If you are using a hosted URL image, you can get your web logs of the API retrieving it properly or not.

Works well identifying bored ape NFTs and image features of each in assistants playground – with an impossible 5000+ input tokens on two images just to make sure you pay the same price as sending images to gpt-4o anyway.

I tried to fool it with a second thread message about another image boredape33.jpg…

KW525 · September 13, 2024, 1:39pm

The above example I gave was a simplified version but I’ve confirmed the prompt +image is exact same and still noticing the degraded performance. I’ve even done identical tests via the playground and api with a direct image upload (vs. an image url) today versus a week prior.

One odd workflow I’ve found that works is if I upload my prompt + image, it spits out a totally erroneous response but if I then afterwards just upload that same image to that same thread, it will properly assess the image. Huge drawback is it ~2x the tokens used and given how the original workflow just stopped working out of the blue I’m afraid of the fragility of this one.

christsomiah · December 31, 2024, 10:52pm

Experiencing the same thing ever since the release GPT o1, prompts I used earlier with 4o no longer work

loschd0927 · January 1, 2025, 10:40am

I am not using images but mine is compleatly ignoring most of my promts and giving me results not even close to my text promts. I tell it to evaluate something and give it the text and it goes so off topic. It feels broken. It has worked fine for over a month doing the exact same thing but now its like broken.

Topic		Replies	Views
GPT 4o mini took a hit ever since o1 was released API gpt-4	10	920	September 18, 2024
Weird behavior in threads API	3	137	October 5, 2024
Bad performance with the same prompt after a weekend API gpt-4o-mini	1	142	October 29, 2024
Huge quality drop in gpt-4-turbo Bugs	13	1096	May 30, 2024
Assistant API performance / accuracy reduced? API	0	166	July 2, 2024

Changes to 4o-mini in last 24hrs that would cause performance degradation?

Related topics