The scope of V in GPTV - what is missing?

paul.fishwick · January 8, 2024, 7:23pm

I am a bit confused about the scope and extent of the vision component of ChatGPT. I tried yesterday to get bounding boxes around objects in an image. This was not successful, and I noticed that there were no libraries called upon to do object detection (YOLO for example). And yet, object detection is basic computer vision. Simple operators such as dilate and 2D FFT seemed to work though. Is there a reason for the above re: detection?

_j · January 8, 2024, 7:30pm

The AI doesn’t have grounding as an aspect it can produce.

No drawing boxes around all the suspicious looking men.

It has an excellent ability to describe what it sees. Not too many other vision AIs could identify a water slide on a cruise ship pool.

Microsoft Azure uses GPT-4-Vision in concert with other vision products of their own to make a combined product.

paul.fishwick · January 9, 2024, 10:01pm

Thanks for this info. Is there a good place where we can read about “grounding” ? I thought that “V” meant that the AI would excel at answering “V” questions (with object detection being a good example of Image->Image in “V”).

_j · January 9, 2024, 10:06pm

Topic		Replies	Views
Make OpenAI Vision API Match GPT4 Vision API chatgpt	4	1702	December 6, 2023
GPT-4-Vision Interesting Uses and Examples Thread (2023) Community gpt-4-vision	24	9126	April 22, 2024
GPT Assistant talks about their task or just posts an example instead of actually performing the task Prompting gpt-4	3	675	November 28, 2023
GPT-4 Vision Refuses to extract Info from Images? API	31	11220	April 29, 2024
ChatGPT goes Multimodal! Sound and vision is rolling out on ChatGPT Community chatgpt , multimodal	34	8084	December 10, 2023

The scope of V in GPTV - what is missing?

Related Topics