Hi there,
I’m playing around with incorporating the computer use API into a Chrome extension. I’m getting responses from the API, but I’m having issues where the coordinates I get back from e.g. a click event don’t correspond to my viewport.
As an example, I am sending a request like as follows:
{
"model": "computer-use-preview",
"tools": [
{
"type": "computer_use_preview",
"display_width": 810,
"display_height": 812,
"environment": "browser"
}
],
"truncation": "auto",
"previous_response_id": "resp_67d...",
"input": [
{
"call_id": "call_EUo...",
"type": "computer_call_output",
"output": {
"type": "input_image",
"image_url": "data:image/png;base64,iVBORw0KGgoAA..."
}
}
]
}
The viewport is 810x812 and I have confirmed that the image I am sending is only of the viewport (I have devtools open to side, hence why it is roughly square).
The action I am getting back in the response is then something like:
{
"type": "computer_call",
"id": "cu_67d5...",
"call_id": "call_iKoO...",
"action": {
"type": "click",
"button": "left",
"x": 968,
"y": 181
},
"pending_safety_checks": [],
"status": "completed"
}
i.e. the x coordinate is beyond the viewport width (and the y coordinate also does not correspond to the intended action)
Is there some additional translation that I need to take into account for the coordinates? Or am I doing something else wrong?
I don’t see anything documented about this and the sample in the docs seems to just pass the x, y coordinates as received.
Many thanks,
Emile