Getting GPT Vision To Return Coordinates

dano1066 · March 7, 2024, 9:59pm

Vision has been doing a really good job at describing the images i provide but i want to try and also make it give me back the x/y coordinates of the specific things in the image so that i can use javascript to create a selection of the data the vision model found on the image.

To test this i have an image that contains a list of items and on the left side of each item is a small icon. Ive tried using the following prompt and it is giving me back coordinates but they arent what I am looking for.

“Each item on the list on the left contains a small icon on the left side. Give me the list of x,y coordinates of the top left and bottom right corner of each of these icons and return it as a json array”

_j · March 7, 2024, 10:04pm

That is not something that gpt-4 vision can do, alone.

bulundum · June 6, 2024, 2:50pm

I tried something similar, then I learned that the Vision API does not actually receive the original image. It gets a twice-resized image and messes the coordinates horribly.

For more information on the resizing behavior, check out https://platform.openai.com/docs/guides/vision title ‘calculating costs’.

brendan.whiting · June 6, 2024, 7:15pm

You might try the Segment Anything model instead: https://segment-anything.com/

There’s a great course on it here: https://learn.deeplearning.ai/courses/prompt-engineering-for-vision-models

rrmehdi · June 13, 2024, 8:53pm

Hi, I can return coordinate of a logo in png image, just with promps.

This is done in chat session with gpt-4-vision-preview. The blue is the ground truth box, and blue is computed by AI, or other way around. The match is perfect. But, I don’t know how to convert that chat session trial into a reliable python script, because whatever I do , AI ChatGPT complain that it cannot analyze images with API yet. Any ideas how to do it? Perhaps, other AI frameworks (Azure OpenAI?) could be more flexible or advanced? My goal is to loop over hundreds of images and collect detected logo objects of many kinds (with coordinates) and their parent images into a dataframe. Any suggestions/recipes?

aayush_shah · September 17, 2024, 8:17am

A couple of good threads to make GPT-4o return co-ordinates!

Reddit - Dive into anything (Main thread)
Reddit - Dive into anything (Sub thread - great discussion)

Bai_Lan_Blues · November 1, 2024, 9:30am

Has there been any improvement in this area? From what I find, it seems like it’s still not quite reliable.
Though in my case I’m looking to identify the location of text instead of objects

hiramreisneto · November 14, 2024, 6:24pm

I have found online a really clever solution for this kind of issue (using LLM to find coordinates on images) that I would like to share with you. The repository is not mine and I have not tested it, but the idea is there and from the examples presented it seems to work quite well. I am planning on testing it in a near future…

(Apparently I cannot post external links here, but you can find a repository called GridGPT on github, from a user called quinny1187…)

In summary, the idea is just a way to give spatial context/awareness to the LLM, layering a numbered grid on the image sent to the model.

honghoa2k2z · February 4, 2025, 1:50pm

Hello, did you try to test it and how does it work ? I tested it and it doesn’t seem to work very well

Saimon_Saikia · May 19, 2025, 1:28am

Well hello everyone. Yes the grid overlay technique actually does work, you can use 100×100 grids (1000 boxes instead of 100) if you want better accuracy…but you need to adjust some things (opacity, thickness, colour) programmatically..

With what I’ve observed Gemini models seem to work in providing bounding box coords with just raw image prompt.. (but I haven’t tested it well for UI elements).

But the grid technique seems excellently well with GPT 4o and 4.1. You can test it yourself.. (if you want free access you can try GitHub Models).

If you want faster results.. (eg. For realtime automated interaction use cases) you might find Gemini 2.0 Flash (live variant) as the best fit for this (with the grid overlay technique). Its quality is lesser than GPT (yet accurate 80-90% of the time), however it gives near realtime responses which might be a very critical factor in realtime automation systems.
Try Gemini Realtime Stream (Google AI Studio)

[*If you specifically are looking for automating the web, then you might want to change the approach and use Gemini 2.5 Flash (Google AI Studio) for DOM analysis, and dynamic control of chromium via selenium/playwright (the anti bot prevention is your own part)]

Topic		Replies	Views
GPT-4-Vision Interesting Uses and Examples Thread (2023) Community gpt-4-vision	24	12242	April 22, 2024
GPT4 V Object detection bounding box value incorrect Prompting gpt-4 , gpt-4-vision	1	2471	June 29, 2024
Image mapping with prompts API gpt-4 , chatgpt , gpt-4-vision	1	1026	July 19, 2024
Make OpenAI Vision API Match GPT4 Vision API chatgpt	4	3917	December 6, 2023
Can an assistant help me with OCR? API gpt-4	7	3709	June 6, 2024

Getting GPT Vision To Return Coordinates

Related topics