GPT API can not do image coordinates right

Hey, so I am trying to work on a project where I extract a part of an image. I am using the GPT models (4.1, 5) to achieve this. I send my base64 encoded image through the API, and ask it to extract a certain part of the image (a photo) and send it to me as coordinates (x1,y1,x2,y2).

When I try this in the web chat environment, it always returns perfect coordinates that have exactly what I want.
When I try this through the API, the coordinates are always off. The sub image always contains some extra text, or just empty space on some sides, it doesn’t seem to be able to do it well.

Why is this? Why is there such a quality difference between the web chat environment, and the API? Is the API somehow modifying the base64 image, that it then just messes up the quality of the output?

1 Like

Try pasting your question into ChatGPT - it gave quite a good answer. Here’s the best one, I think:

Ask for normalized coordinates and rescale yourself

  1. In your prompt, explicitly define the coordinate space and format:

“Return bounding box as normalized floats in [0,1]: {x_min, y_min, x_max, y_max} relative to the image width/height (origin top-left). No other text.”

1 Like

Sadly, this does not work very well either. Still clipping the images. I suspect the ChatGPT web service has some behind the scenes vision processing, that is simply not available through the API, which is a massive shame.