Inconsistencies in Image Analysis with GPT-4o-mini Using Low Detail

Hello OpenAI community,

I’m currently working on a project using the GPT-4o-mini API to analyze images. I’ve noticed inconsistencies in the model’s ability to analyze certain images, and I’m hoping to get some clarification and advice from the community.

Context

  • Model used: GPT-4o-mini
  • Task: Image analysis from URLs
  • Method: Using the ChatCompletion API with the image_url parameter
  • Detail level: Low (as specified in the API call)

Observed Problem

Some images are successfully analyzed, while others generate a response indicating that the model is not able to analyze images.

Here is an example : “I can’t view or analyze images directly, but if you provide details or text from the image, I can help you understand or summarize that information.”

What’s intriguing is that:

  1. The same images that fail to be analyzed once can be successfully analyzed at other times.
  2. The same code works perfectly with a superior model (GPT-4o), without any analysis issues, also using low detail.

Example API Call

response = openai.ChatCompletion.create(
    model="gpt-4o-mini",
    messages=[
        {
            "role": "user",
            "content": [
                {"type": "text", "text": "What are the info on this image?"},
                {
                    "type": "image_url",
                    "image_url": {
                        "url": image_url,
                        "detail": "low"
                    }
                }
            ]
        }
    ],
    max_tokens=500,
)

Questions

  1. Are there specific limitations for GPT-4o-mini in terms of image size, format, or complexity when using low detail that I should be aware of?
  2. Are there any best practices for optimizing image analysis with this particular model and detail level?
  3. Are there any known issues or limitations with GPT-4o-mini regarding image analysis at low detail that could explain this inconsistent behavior?

Any information or advice would be greatly appreciated. If additional details are needed, I’d be happy to provide them.

Thank you in advance for your help!

The “o” model has multimodal capabilities that have not been released.

It also has training on denying those abilities, like it’s not going to output tokens of images or speech, and that refusal spills over.

So it likes to deny, and is of low understanding. You need some system prompt that lets the AI know that it has built-in computer vision ability, its image examination skill is enabled, etc, to defeat refusals and denials. Plus, it has less factual preservation of traning data, being small.

Low images are encoded to just a few tokens without repeating tiling after, so it just may be not enough to grab the mini AI’s attention, or resized to 512 pixels maximum at detal:low, there may little meaning to be had on some images.

When you ask, try not “what are the info on this image”, but “From the attached image, using your own computer vision skill, extract all the information available: the text or a description of contents”, or what you expect.

Then in the system prompt of a specialist, how about “You are Look-o, an AI with image analysis capabilities built-in and enabled.”

If gpt-4o-mini was satisfactory all the time, there would be no reason to upgrade to a more expensive higher-quality model (like gpt-3.5-turbo).

1 Like