How are the computer use API coordinates calculated?

Hi there,

I’m playing around with incorporating the computer use API into a Chrome extension. I’m getting responses from the API, but I’m having issues where the coordinates I get back from e.g. a click event don’t correspond to my viewport.

As an example, I am sending a request like as follows:

{
    "model": "computer-use-preview",
    "tools": [
        {
            "type": "computer_use_preview",
            "display_width": 810,
            "display_height": 812,
            "environment": "browser"
        }
    ],
    "truncation": "auto",
    "previous_response_id": "resp_67d...",
    "input": [
        {
            "call_id": "call_EUo...",
            "type": "computer_call_output",
            "output": {
                "type": "input_image",
                "image_url": "..."
            }
        }
    ]
}

The viewport is 810x812 and I have confirmed that the image I am sending is only of the viewport (I have devtools open to side, hence why it is roughly square).

The action I am getting back in the response is then something like:

{
    "type": "computer_call",
    "id": "cu_67d5...",
    "call_id": "call_iKoO...",
    "action": {
        "type": "click",
        "button": "left",
        "x": 968,
        "y": 181
    },
    "pending_safety_checks": [],
    "status": "completed"
}

i.e. the x coordinate is beyond the viewport width (and the y coordinate also does not correspond to the intended action)

Is there some additional translation that I need to take into account for the coordinates? Or am I doing something else wrong?

I don’t see anything documented about this and the sample in the docs seems to just pass the x, y coordinates as received.

Many thanks,

Emile

1 Like

Welcome back!

Neither OpenAI nor Azure seem deem my accounts worthy of access to computer use, so I can only armchair quarterback with earplugs and blindfolds here.

You’re right, the documentation doesn’t mention anything about having to normalize coordinates, unlike what’s common with other models. (out of 1000, 100, or 1), so it’s even more difficult to debug.

Sometimes the models just hallucinate though.

Stuff you can try (based on other pointing VLM use exp):

  1. Try to stick with the dimensions in the example (https://platform.openai.com/docs/guides/tools-computer-use#1-send-a-request-to-the-model)
       "display_width": 1024,
        "display_height": 768,
  1. Improve your prompt
  • Sometimes there’s ambiguity in what it needs to select, and a more coherent prompt can help here.
  1. Try to avoid illusions/detractors
  • Sometimes there are features in images that seem to be able to ‘shift’ a model’s perspective, for lack of better words. This manifests in all coordinates being scaled and rotated in a way you wouldn’t expect. This sometimes happens with slightly tilted images, framed images, and sometimes if there are features that induce some perspective. I don’t know how susceptible CU is to that, but according to their limitations (https://platform.openai.com/docs/guides/tools-computer-use#limitations), especially the model’s unsuitability to RW tasks indicates that it will likely struggle with this.

  • sometimes you can get around this by sampling a bunch of known points (you can perhaps extract with CV) and finding the transformation.


Do you think you could share your screenshot and instruction though, so someone else can try to reproduce this particular bug?

1 Like

Thanks for the help @Diet - unfortunately I’m not necessarily in control of the viewport dimensions in this case so using the same dimensions from the sample won’t work.

However, I took a guess that the response coordinates were for a standardized 1024x1024 image and it looks like that was correct - translating from those dimensions to my viewport is producing the expected result now.

2 Likes

Great stuff!

Unfortunately, it’s quite possible that this can change at any moment without notice :confused:

1 Like