The scope of V in GPTV - what is missing?

I am a bit confused about the scope and extent of the vision component of ChatGPT. I tried yesterday to get bounding boxes around objects in an image. This was not successful, and I noticed that there were no libraries called upon to do object detection (YOLO for example). And yet, object detection is basic computer vision. Simple operators such as dilate and 2D FFT seemed to work though. Is there a reason for the above re: detection?

The AI doesn’t have grounding as an aspect it can produce.

No drawing boxes around all the suspicious looking men.

It has an excellent ability to describe what it sees. Not too many other vision AIs could identify a water slide on a cruise ship pool.

Microsoft Azure uses GPT-4-Vision in concert with other vision products of their own to make a combined product.

1 Like

Thanks for this info. Is there a good place where we can read about “grounding” ? I thought that “V” meant that the AI would excel at answering “V” questions (with object detection being a good example of Image->Image in “V”).

1 Like