Building a Java Desktop Computer-use agent with OpenAI Responses API

Hi everyone,

I’m developing a desktop agent using Java/JavaFX and the official openai-java library. My goal is to create a “self-operating computer” that can perform local tasks based on user prompts.

My Current Setup:

  • Architecture: My Java application takes a screenshot of the desktop, sends it to a vision model (like GPT-4o), and receives instructions.

  • Local Tools: I have a set of local tools built in Java (using java.awt.Robot) that execute the model’s decisions, such as CLICK_TOOL, TYPE_TOOL, SCROLL_TOOL, etc.

  • Model: I am using the standard API models (e.g., gpt-4o). I haven’t found a way to access a specialized “computer use” model, so I’m trying to get the best results with the available tools.

The Core Problem: Unreliable Clicks

My biggest challenge is reliability. The agent often understands what it needs to do but fails in the execution. Specifically, the GPT model provides incorrect locations for clicks.

To solve the problem of the model guessing coordinates, I have already implemented a grid overlay strategy:

  1. My Java code takes a screenshot.

  2. It draws a labeled grid (e.g., A1, B2, C3…) over the entire image.

  3. This modified, gridded image is sent to the model.

  4. The model is prompted to return a grid_id (e.g., “G12”) instead of x/y coordinates.

  5. My local code then translates this grid_id back into a precise pixel coordinate to click.

While this has improved accuracy compared to guessing coordinates, the model still frequently identifies the wrong grid cell, causing it to click on the wrong element or an empty space.

My Questions for the Community:

  1. For those who have built similar agents, what are the most effective strategies to ensure reliable and precise clicks? Are there more robust methods than a grid overlay, perhaps involving OCR-based element labeling or accessibility APIs (and are these feasible in Java)?

  2. How do you structure your prompts to force better decision-making and prevent the model from getting stuck in loops or choosing a suboptimal tool (e.g., using CLICK on a taskbar icon instead of OPEN_APP)?

  3. Is there a specific model version or prompting technique (e.g., specific chain-of-thought structure) that is known to perform better for this kind of precise UI interaction?

Any advice, links to similar projects (especially in Java), or prompt engineering tips would be greatly appreciated. This is a fascinating but challenging problem to solve.

Thanks for your time!