Hi everyone,
I’m developing a desktop agent using Java/JavaFX and the official openai-java library. My goal is to create a “self-operating computer” that can perform local tasks based on user prompts.
My Current Setup:
-
Architecture: My Java application takes a screenshot of the desktop, sends it to a vision model (like GPT-4o), and receives instructions.
-
Local Tools: I have a set of local tools built in Java (using
java.awt.Robot) that execute the model’s decisions, such asCLICK_TOOL,TYPE_TOOL,SCROLL_TOOL, etc. -
Model: I am using the standard API models (e.g.,
gpt-4o). I haven’t found a way to access a specialized “computer use” model, so I’m trying to get the best results with the available tools.
The Core Problem: Unreliable Clicks
My biggest challenge is reliability. The agent often understands what it needs to do but fails in the execution. Specifically, the GPT model provides incorrect locations for clicks.
To solve the problem of the model guessing coordinates, I have already implemented a grid overlay strategy:
-
My Java code takes a screenshot.
-
It draws a labeled grid (e.g., A1, B2, C3…) over the entire image.
-
This modified, gridded image is sent to the model.
-
The model is prompted to return a
grid_id(e.g., “G12”) instead of x/y coordinates. -
My local code then translates this
grid_idback into a precise pixel coordinate to click.
While this has improved accuracy compared to guessing coordinates, the model still frequently identifies the wrong grid cell, causing it to click on the wrong element or an empty space.
My Questions for the Community:
-
For those who have built similar agents, what are the most effective strategies to ensure reliable and precise clicks? Are there more robust methods than a grid overlay, perhaps involving OCR-based element labeling or accessibility APIs (and are these feasible in Java)?
-
How do you structure your prompts to force better decision-making and prevent the model from getting stuck in loops or choosing a suboptimal tool (e.g., using
CLICKon a taskbar icon instead ofOPEN_APP)? -
Is there a specific model version or prompting technique (e.g., specific chain-of-thought structure) that is known to perform better for this kind of precise UI interaction?
Any advice, links to similar projects (especially in Java), or prompt engineering tips would be greatly appreciated. This is a fascinating but challenging problem to solve.
Thanks for your time!