Are there any plans to improve GPT4's object localization abilities?

Hello all,

I am a huge fan of OpenAI’s products and the GPT api, and have been really pleased at OpenAI’s latest developments in drastically improving the gpt4 api’s context length limit and performance, as well as its addition of image features.

While I have been extremely impressed with gpt4’s ability to understand images it is provided with, it seems to be unable to localize any of the objects in the image.

For instance, provide it with a UI interface for example. If it could help navigate UI by itself, it could be groundbreaking for disabled people trying to navigate UI outside of a browser, as well as UI automation, and would just be a general improvement to gpt’s reputation as an assistant.

However, in testing while I’ve observed GPT4 to be exceptionally good at recognizing what it’s looking at, clearly providing value to a potential solution there–“oh, thats a google chrome browser on this website/oh, that’s the Intellij project structure settings” etc, if I give it control of a mouse via coordinate inputs, upon trying to click something, it’ll face an immense challenge. For example: it might move the mouse say a little too far to the left of its target.

Then, with a new picture snapped of the mouse to the left of the target (even when the mouse is turned into a large reticle), it’ll continuously say “oh, it looks like the mouse is to the RIGHT of the object. Lets move it slightly to the left” which is incorrect, as it will now continuously move the mouse further to the left away from its target and never click it.

It recognizes what its looking at, it knows what it needs to do and often exactly what it needs to click, it just can’t pinpoint it.

Some people have created workarounds to this challenge. For instance, this person got gpt4 to navigate browser ui by providing gpt with the direct makeup of visible browser elements: (link redacted)meet-gpt-4v-act-a-multimodal-ai-assistant-that-harmoniously-combines-gpt-4vision-with-a-web-browser/

However, this only works with browsers. A lot of the most confusing ui is found elsewhere, such as Intellij/Pycharm or Visual Studio’s project configurations pages.

The other solution which I’ve personally looked into is to use a secondary computer vision model to identify elements, and then maybe couple that with gpt4’s ability to see what its looking at and strategize a plan. Like this person has done in this article here: (link-redacted)dino-gpt-4v/

This can work, and it’s probably my go-to solution in the meantime, albeit not with that specific computer vision model in my case as there are others specifically for finding UI elements, however this is now a communications game between two AI Models: one which can target but can’t necessarily label its elements with the specificity that gpt4 has achieved, and then gpt4 which can label but can’t localize what it sees. It becomes an awkward tag team operation that I imagine, if one model could simply do both like a human can, could have both skills even benefit to inform each other better.

So in any case, thats my question: has its lack of object localization been observed on your (OpenAI’s) end? And are there any plans to potentially improve gpt4’s skill in that one area?

3 Likes

Hi,

I completely echo this concern. Accurate object localization in images is crucial for my use cases as well, particularly regarding UI automation and accessibility. This capability, if improved, would significantly broaden GPT-4’s practical applications. Hoping OpenAI will prioritize this in future updates.

thank you

1 Like

I came here to ask the same question. I’ve tried a few things that are close, but not quite right. I’ll share in case someone finds them useful.

First, I ask GPT to localize in percentage units, since I don’t know how the image is resized internally. I’ve found that it can guess somewhere in the right area of the image, but not precise enough for clicking. It can also be prompted to plot the image with Matplotlib and overlay a red mark, and revise its guess.

Here I’ve asked it to identify where to click to rate this film as 2.5 stars. It knows to click on the left half of the middle star. For the coordinates it guessed somewhere, then I asked it to check its work:

It does the right thing, but gets the wrong answer.

Other things I’ve tried, to no avail:

  • Layering a grid, labelled like spreadsheet cells/columns, and asking GPT “which cell is the target in”. It frequently picks the wrong cell.
  • Layering a red dotted box over half the image and asking “is the target in this box” and then iterating: moving and halving the size of the box and trying again. This works OK, and shows that GPT does ‘know’ where the target is, but takes far too many iterations to get down to a clickable element.
  • Showing it a dot and asking it to move it some amount. Unreliable.

All these approaches are ultimately flawed anyway, because layering things on top of images will fail for some images (e.g. an image of a collection of red crosses in a spreadsheet)

So I guess for now the only thing is a two-part system with GPT describing what it wants to click on to perform an action, and some other model localizing that.

I could probably train a model using GPT-generated descriptions as the labels, but if someone knows of a particularly good model for doing this sort of thing I’d love to hear it.