I’m using an assistant with 4o-mini to evaluate images and I test edge cases to make sure it’s calibrated properly (e.g., ask it to evaluate a picture of an airplane but pass it an image of an apple). Last ~1 mo been using the same prompt and getting consistent results recognizing an image is not relevant (e.g., continuing the above example - recognizing it isn’t an image of an airplane but an apple). In the last 24 hours the model returns completely hallucinated results as if it doesn’t even evaluate any image (e.g., “the airplane image is xyz…” when in fact it’s an apple). Switching to the 4o model returns the proper response. Has anything happened or changed that would cause a sudden shift in performance?
I don’t have a fingerprint of the model from beyond 21 hours ago, but it is still the same as then, system_fingerprint='fp_483d39d857, and was the same for all requests (unlike other models that can return several fingerprints).
That can be something to look for if you are logging full requests, as that is supposed to be an indicator of determinism.
Images are subject to outside processing before they are tokenized, so that may be an API factor that could also change or break images with a keystroke.
If you are using a hosted URL image, you can get your web logs of the API retrieving it properly or not.
Works well identifying bored ape NFTs and image features of each in assistants playground – with an impossible 5000+ input tokens on two images just to make sure you pay the same price as sending images to gpt-4o anyway.
I tried to fool it with a second thread message about another image boredape33.jpg…
The above example I gave was a simplified version but I’ve confirmed the prompt +image is exact same and still noticing the degraded performance. I’ve even done identical tests via the playground and api with a direct image upload (vs. an image url) today versus a week prior.
One odd workflow I’ve found that works is if I upload my prompt + image, it spits out a totally erroneous response but if I then afterwards just upload that same image to that same thread, it will properly assess the image. Huge drawback is it ~2x the tokens used and given how the original workflow just stopped working out of the blue I’m afraid of the fragility of this one.