How are the computer use API coordinates calculated?

emilepw · March 15, 2025, 3:23pm

Hi there,

I’m playing around with incorporating the computer use API into a Chrome extension. I’m getting responses from the API, but I’m having issues where the coordinates I get back from e.g. a click event don’t correspond to my viewport.

As an example, I am sending a request like as follows:

{
    "model": "computer-use-preview",
    "tools": [
        {
            "type": "computer_use_preview",
            "display_width": 810,
            "display_height": 812,
            "environment": "browser"
        }
    ],
    "truncation": "auto",
    "previous_response_id": "resp_67d...",
    "input": [
        {
            "call_id": "call_EUo...",
            "type": "computer_call_output",
            "output": {
                "type": "input_image",
                "image_url": "data:image/png;base64,iVBORw0KGgoAA..."
            }
        }
    ]
}

The viewport is 810x812 and I have confirmed that the image I am sending is only of the viewport (I have devtools open to side, hence why it is roughly square).

The action I am getting back in the response is then something like:

{
    "type": "computer_call",
    "id": "cu_67d5...",
    "call_id": "call_iKoO...",
    "action": {
        "type": "click",
        "button": "left",
        "x": 968,
        "y": 181
    },
    "pending_safety_checks": [],
    "status": "completed"
}

i.e. the x coordinate is beyond the viewport width (and the y coordinate also does not correspond to the intended action)

Is there some additional translation that I need to take into account for the coordinates? Or am I doing something else wrong?

I don’t see anything documented about this and the sample in the docs seems to just pass the x, y coordinates as received.

Many thanks,

Emile

Diet · March 15, 2025, 4:11pm

Welcome back!

Neither OpenAI nor Azure seem deem my accounts worthy of access to computer use, so I can only armchair quarterback with earplugs and blindfolds here.

You’re right, the documentation doesn’t mention anything about having to normalize coordinates, unlike what’s common with other models. (out of 1000, 100, or 1), so it’s even more difficult to debug.

Sometimes the models just hallucinate though.

Stuff you can try (based on other pointing VLM use exp):

Try to stick with the dimensions in the example (https://platform.openai.com/docs/guides/tools-computer-use#1-send-a-request-to-the-model)

       "display_width": 1024,
        "display_height": 768,

Improve your prompt

Sometimes there’s ambiguity in what it needs to select, and a more coherent prompt can help here.

Try to avoid illusions/detractors

Sometimes there are features in images that seem to be able to ‘shift’ a model’s perspective, for lack of better words. This manifests in all coordinates being scaled and rotated in a way you wouldn’t expect. This sometimes happens with slightly tilted images, framed images, and sometimes if there are features that induce some perspective. I don’t know how susceptible CU is to that, but according to their limitations (https://platform.openai.com/docs/guides/tools-computer-use#limitations), especially the model’s unsuitability to RW tasks indicates that it will likely struggle with this.
sometimes you can get around this by sampling a bunch of known points (you can perhaps extract with CV) and finding the transformation.

Do you think you could share your screenshot and instruction though, so someone else can try to reproduce this particular bug?

emilepw · March 15, 2025, 6:01pm

Thanks for the help @Diet - unfortunately I’m not necessarily in control of the viewport dimensions in this case so using the same dimensions from the sample won’t work.

However, I took a guess that the response coordinates were for a standardized 1024x1024 image and it looks like that was correct - translating from those dimensions to my viewport is producing the expected result now.

Diet · March 15, 2025, 6:14pm

Great stuff!

Unfortunately, it’s quite possible that this can change at any moment without notice

Topic		Replies	Views
Getting GPT Vision To Return Coordinates Prompting gpt-4 , gpt-4-vision	10	8877	July 30, 2025
Getting data from other peoples images on vision API Bugs gpt-4	1	87	August 17, 2024
Unexpected surge in cost: assistant + gpt-4-vision-preview API gpt-4 , image-reading , assistants-pricing	11	378	June 27, 2024
GPT-4o Model: Image Coordinate Recognition API gpt-4	33	7207	August 26, 2025
Why Do Vision Models Count Correctly in UI But Not Via API? API api	1	56	May 5, 2025

How are the computer use API coordinates calculated?

Related topics