Strange/Bad behavior of Open AI API with vision models

We would like to use Open AI API with vision model for analysis of insurance event photographs.
Initially we have tested the concept using Chat GPT, concluding that results are sufficient to proceed to implementation using API

Once starting to use API (with the same prompts !!!) our conclusions are:
1)- model gpt-4o does not process photographs at all
2)- comparing results of model gpt-4o-mini between Chat GPT and API: the results of API are significantly worse

Point 1:
We have tested the API call more than 5 times and the response of API is fully consistent. We are receiving following response:
“I’m sorry, but I cannot provide an analysis of photographs or visual content as you requested. My capabilities are limited to text-based analysis and generation…” !!!
We have tried gpt-4o with detail = “high” and the result is the same. We are really confused by this behavior.

Point2
Using the same prompt and the same images we have compared reults of gpt-4o-mini model using ChatGPT and API. We have noticed significant quality difference between behavior of ChatGPT(reasonably good) with API (my manager qualified it as “unusable”)
For instance ChatGPT is able to recognize pictures “gazebo” from “House”, API is consistently recognizing these pictures with the same prompt as “House”.
Unfortunately in this public forum, I can not publish the client photographs and prompt itself probably does not make sense

Can you advise what to do now? we have invested quite significant effort to this solution and the final results are not very good…

Good day Jiri.

This sounds like a prompting or structuring issue. The model gpt-4o can process images, and is trained to know that it can process images. ChatGPT has it’s own instructions and in some cases is a unique model. There is a huge problem if the API results aren’t better than ChatGPT.

What happens sometimes is the prompt hits an edge-case where artifacts of previous training are hit, causing the model to respond in bizarre ways (like saying it’s incapable of reading images).

The prompt is very important. If you can post it here I bet we can solve this problem.

Worth mentioning that it may be better to use a local model. SigLip2 was just released and not only could probably tag a gazebo, but also be fine-tuned to pay more attention to the areas you are looking for.

The prompt is following:
Based on the provided photographs, analyze:
- cause_of_damage_event
- object_of_natural_event
- specification_of_damaged_part_of_object

  and finally, provide a summary description of the damage in czech language.

  cause_of_damage_event:
  - zivelni udalost-vitr
  - zivelni udalost-voda
  - zivelni udalost-kroupy
  - zivelni udalost-prepeti

  object_of_natural_event:
  - dům
  - altán, zahradní dům
  - garáž
  - pergola
  - bazén
  - počítač, notebook
  - neurčené zařízení

  specification_of_damaged_part_of_object:
  - střecha
  - okap
  - stěna
  - strop
  - podlaha
  - dveře
  - balkón
  - solární panel
  - dlažba
  - elektronika
  - vnější kryt

  summary_description: - provide a textual description of the damage here

  Follow these instructions:
  - Analyze exceptional circumstances in the photographs.
  - Compare these circumstances with the above-listed options.
  - A combination of natural causes (wind, water) is possible. In this case, list all identified causes in the YAML file.
  - A combination of damaged object parts (roof, paving) is possible. In this case, list all identified damaged parts in the YAML file.
  - Some information (cause_of_damage_event, specification_of_damaged_part_of_object, etc.) may not be determinable; in this case, leave the result empty.
  - An open device does not automatically indicate overvoltage – the casing may be deliberately opened by the user to present the damage. If there are no direct signs of overvoltage (e.g., burnt components, conductor deformation, or signs of electrical arcing), do not include overvoltage as a cause of damage. If the cause cannot be clearly determined, leave the cause_of_damage_event field empty.

  Finally, generate a YAML text with the following format:
  - pricina_skodni udalosti [cause_of_damage_event]
  - objekt_zivelni_udalosti [object_of_natural_event]
  - specifikace_poskozene_casti_objektu [specification_of_damaged_part_of_object]
  - souhrnny popis skody [summary_description_of_damage]

  The expected output is YAML text ONLY (as it will be processed automatically).

When I tried this prompt the model was able to read the image(s). My thoughts are currently that the code is not properly delivering the payload.

Try going to the OpenAI Playground and sending the image and instructions to see if it works better there:

https://platform.openai.com/playground/chat?models=gpt-4o

I think they police the API more to discourage automated recognition that might breach privacy. That could harm its performance. That’s been my impression comparing ChatGPT with the API anyway. They definitely behave differently.

You will need a system prompt (aka developer role message) specifically aligning the AI with what vision task it CAN do.

OpenAI dumped a whole bunch of prompt of what the AI cannot do with vision as the very first system message text when an image is added, that comes before anything you can write, which produces denials and degrades all API models. Your system prompt text is now “btw additional info”. You have to counter this image refusal predisposition and outright lying to the AI about its capabilities by OpenAI with useful confidence-instilling language about the task.

Do not use gpt-4o-mini for API images, unless you like paying TWICE AS MUCH per image vs gpt-4o.

Thank you for valuable inputs. I was trying many changes and watching the results…Finally I was able (hopefully - testing it on 10 cases) to make gpt-4o working…
The changes made:

  • initially I was using “detail” attribute, putting it to “high” - I stopped to put it into url section
  • I had empty system message - I removed it completely
  • Once specifying gpt-4o model in the request, I have found in the response, that GPT is actually using gpt-4o-2024-08-06 ???. I have changed the model version to gpt-4o-2024-11-20 and it started to work…

Probably I will return back the system message now, putting there your recommendation.

Anyway I do not understand last sentence about the pricing. In Pricing OpenAI they say for 4o 2,5 usd/1M and for 4o-mini 0.15 usd/1M tokens.

Here is a “system” role message that will leave the AI asking for missing images instead of denying it even has the ability:

“You are a GPT-4 vision AI model for image analysis. You can view images. The user can include images with their messages as input. Expect to receive and analyze attached images as your primary function.”

Then a user can simply ask “describe image contents”, and will even receive error messages about “I didn’t get any images” instead of “I do not have the ability”.


OpenAI had multiplied the cost of images sent to gpt-4o-mini. The scalar is 33x that of gpt-4o, 16.6x that of gpt-4o-2024-05-13. They actually multiplied the token count billed, so there was no “bargain vision”.

It seems today, though, at least in terms of the reported input token consumption that comes with the usage object in the API call, this behavior has changed, here sending with “detail:low”:

input token cost - gpt-4o-2024-11-20

images 0 1 2 3
gpt4o 55 362 428 494
delta - 307 81 66

input token cost - gpt-4o-mini

images 0 1 2 3
gpt4o-mini 55 362 428 494
delta - 307 81 66

One would need to check the billed costs of a call in isolation, using the platform “usage” (and with “free 10m tokens with training” turned off) to see if they finally reduced the cost of gpt-4o-mini to make sense instead of enforcing some cost policy.

(curious is that image #3 is lower usage, although each is a unique image of randomness)

However, what has also changed is OpenAI billing YOU for text injection THEY are sending to the AI model, a bunch of text about not identifying people. A detail:low image should be 75-85 tokens (more compressed on newer models). However, observe the huge input cost jump with only including one image above.

(270 tokens “out” of repeating back the system message)