Getting GPT Vision To Return Coordinates

Vision has been doing a really good job at describing the images i provide but i want to try and also make it give me back the x/y coordinates of the specific things in the image so that i can use javascript to create a selection of the data the vision model found on the image.

To test this i have an image that contains a list of items and on the left side of each item is a small icon. Ive tried using the following prompt and it is giving me back coordinates but they arent what I am looking for.

“Each item on the list on the left contains a small icon on the left side. Give me the list of x,y coordinates of the top left and bottom right corner of each of these icons and return it as a json array”

6 Likes

That is not something that gpt-4 vision can do, alone.

3 Likes

I tried something similar, then I learned that the Vision API does not actually receive the original image. It gets a twice-resized image and messes the coordinates horribly.

For more information on the resizing behavior, check out https://platform.openai.com/docs/guides/vision title ‘calculating costs’.

3 Likes

You might try the Segment Anything model instead: https://segment-anything.com/

There’s a great course on it here: https://learn.deeplearning.ai/courses/prompt-engineering-for-vision-models

2 Likes

Hi, I can return coordinate of a logo in png image, just with promps.


This is done in chat session with gpt-4-vision-preview. The blue is the ground truth box, and blue is computed by AI, or other way around. The match is perfect. But, I don’t know how to convert that chat session trial into a reliable python script, because whatever I do , AI ChatGPT complain that it cannot analyze images with API yet. Any ideas how to do it? Perhaps, other AI frameworks (Azure OpenAI?) could be more flexible or advanced? My goal is to loop over hundreds of images and collect detected logo objects of many kinds (with coordinates) and their parent images into a dataframe. Any suggestions/recipes?

A couple of good threads to make GPT-4o return co-ordinates! :fire:

  1. Reddit - Dive into anything (Main thread)

  2. Reddit - Dive into anything (Sub thread - great discussion)

Has there been any improvement in this area? From what I find, it seems like it’s still not quite reliable.
Though in my case I’m looking to identify the location of text instead of objects

I have found online a really clever solution for this kind of issue (using LLM to find coordinates on images) that I would like to share with you. The repository is not mine and I have not tested it, but the idea is there and from the examples presented it seems to work quite well. I am planning on testing it in a near future…

(Apparently I cannot post external links here, but you can find a repository called GridGPT on github, from a user called quinny1187…)

In summary, the idea is just a way to give spatial context/awareness to the LLM, layering a numbered grid on the image sent to the model.

2 Likes