How good should the vision / picture understanding be?

Just trying to figure out what I can expect quality wise. I had Dall-e generate a 1024 x1024 picture of a park by the water with 2 sailboats. When I send back the image and ask how many sailboats there are it keeps saying 3 (there are clearly only 2). There are 3 old people in the picture, but it only detects two. There is a girl behind two people who are playing chess… it says she is watching them but she is clearly not. It says a child is flying a kite… no child is flying a kite… though there is a kite in the shape of a balloon and gpt4 identified that as both the kite and a hot air balloon. It says 2 children are playing with a dog… nope. It says a woman is practicing yoga… nothing like that in the picture. Overall it is not getting many things correct.

It also looks like I have to upload the image with each request. I had hoped to make followup requests of the same image without paying to process it for each question.

When you ask what?

ChatGPT that just generated a picture for you? It cannot ingest images except by user upload. It will make up stuff about the image from the prompt and from what it sent.

An API language model that you send the image that was created to? If you are using one of the two AI models that can accept an image, attaching the image to a message either as base64 or as a URL where you have uploaded it yourself, deciding if you want high or low detail, then the AI might have a chance of analyzing the picture correctly.

As far as counting things, that’s one of the shortcomings of the GPT_4_vision, although if presenting similar things, it can count up to around five before it goes wonky.


Image:

brochure

gpt-4-turbo API, system programmed and prompted to perform a vision task:

There are five humans in the image. Here is a short description of each, starting from the left:

  1. Caucasian Female: Young, with long brown hair, wearing a purple shirt and a multicolored scarf.
  2. African American Male: Young, with short black hair, wearing a maroon backpack and a purple shirt.
  3. Hispanic Female: Young, with long dark brown hair, wearing a teal blouse.
  4. African American Male: Young, with short curly black hair, wearing a green and purple plaid shirt.
  5. Caucasian Female: Young, with long blonde hair, wearing a pink jacket and a pink scarf.
    [elapsed: 8.18 seconds]
1 Like

I sent back the image I had generated with Dall-e to ask questions about it using gpt-4-turbo. It kept bouncing back and forth detecting two old people on the left or the two old people on the right but it would not identify all three as such. I would have thought counting was much better, machine learning can identify and count objects better than what I was seeing. Maybe I need to send more simple pictures to it.

I forgot to include the picture earlier… now attached.