Vision is creating completely made-up answers

I am making requests to the chat completion with the gpt-4-vision-preview model, asking for a description of the image. Whether I send base64 or a URL, the response I get is completely made-up from whole cloth. It bears no resemblance to the image. It often generates wildly different complete hallucinations about what is in the image, but if the url has words in it, the responses seem to be made up based on those words.

For example, https://support.content.office.net/en-us/media/4c10ecfd-3008-4b00-9f98-d41b6f899c2d.png is just an image of a blue screen of death from a Microsoft support forum. Vision apparently uses the words Microsoft and Office to confidently tell me this:

The image you’ve provided is a screenshot of the Microsoft Excel program. It shows a spreadsheet with a focus on cells A1 to D6. The cells contain the following data:

  • Row 1 has headers: “Date”, “Item”, “Cost”, and “Sales”.
  • Rows 2 to 6 have various dates in the “Date” column, different items listed such as “Widgets” and “Gadgets” in the “Item” column, numerical values in the “Cost” column, and numerical values in the “Sales” column.
    The “Cost” and “Sales” columns are formatted as currency. The cell D6, which is the bottom cell in the “Sales” column, has a blue border indicating it might be selected or active. There are no visible formulas, and the ribbon at the top shows the “Home” tab, suggesting that no specific functions or formatting options are currently in use.

I have tried prompting it to only report what is in the image, etc. I have also varied the temperature with no discernable effect. It just keeps making up similar highly detailed nonsense, sometimes about different Microsoft Office applications.

2 Likes

What happens when you stipulate a low temperature? Does the answer improve?

1 Like

Welcome @kbkev78

In my test with your image, it replies perfectly with:

The image displays a computer screen showing the Blue Screen of Death (BSOD), which is a common error screen displayed on a Windows operating system after a system crash. The screen has a sad face emoticon at the top and a message stating, ‘Your device ran into a problem and needs to restart. We’re just collecting some error info, and then we’ll restart for you.’ Below the message is a progress indicator showing ‘0% complete’ and a QR code that likely links to more information about the error.

Here’s the code I used:

from openai import OpenAI
client = OpenAI()


response = client.chat.completions.create(
    model="gpt-4-vision-preview",
    messages=[
        {
            "role": "system",
            "content": [
                {
                    "type": "text",
                    "text": "Respond with a description of the image in json format with a key 'description'"
                },
            ],
        },
        {
            "role": "user",
            "content": [
                {
                    "type": "image_url",
                    "image_url":{
                        "url":  f"https://support.content.office.net/en-us/media/4c10ecfd-3008-4b00-9f98-d41b6f899c2d.png"
                    }
                },

            ],
        }
    ],
    max_tokens=1000,
)

print(response)
3 Likes

Thank you for the prompt replies. This was my bad - my request was malformed. I was serialising the “content” json before adding it to the message. Now that I’m making the request properly, it’s behaving as expected.

Perhaps a better response from the API would have been to throw an error when my request came in like this:

{
“model”: “gpt-4-vision-preview”,
“messages”: [{
“role”: “system”,
“content”: “You are an AI assistant powered by GPT-4 with computer vision.\r\nAI knowledge cutoff: April 2023\r\n\r\nBuilt-in vision capabilities:\r\n- extract text from image\r\n- describe images\r\n- analyze image contents”
}, {
“role”: “user”,
“content”: “[{"type":"text","text":"Describe this image. The filename does not indicate the image content."},{"type":"image_url","image_url":{"url":"https://support.content.office.net/en-us/media/4c10ecfd-3008-4b00-9f98-d41b6f899c2d.png"}}]”
}
],
“max_tokens”: 1000,
“temperature”: 0.1
}

The content property effectively lacks the required sub-properties to be a correct image request, so I guess it just processed it as a chat completion request or something.

2 Likes

Sorry to jump into your chat, I can create a separate one if you wish. I am struggling with calling my assistant with the URL. My assistant will analyse the image in the URL and provide some data in a specific way about what it “sees”. Using the above code just calls ChatGPT so how can I specify that it should use my assistant’s ID instead. TIA.

The assistants endpoint cannot be used with the gpt-4-vision-preview model, nor does it have internet access.

ChatGPT is the name of OpenAI’s web chatbot, not an API product.

So: the only way you can augment a chatbot of your own, that is built with the Assistants framework, to have vision ability by GPT-4-vision, is to write some very awkward tool function to call a different API call.

2 Likes