Hello all,
I am a huge fan of OpenAI’s products and the GPT api, and have been really pleased at OpenAI’s latest developments in drastically improving the gpt4 api’s context length limit and performance, as well as its addition of image features.
While I have been extremely impressed with gpt4’s ability to understand images it is provided with, it seems to be unable to localize any of the objects in the image.
For instance, provide it with a UI interface for example. If it could help navigate UI by itself, it could be groundbreaking for disabled people trying to navigate UI outside of a browser, as well as UI automation, and would just be a general improvement to gpt’s reputation as an assistant.
However, in testing while I’ve observed GPT4 to be exceptionally good at recognizing what it’s looking at, clearly providing value to a potential solution there–“oh, thats a google chrome browser on this website/oh, that’s the Intellij project structure settings” etc, if I give it control of a mouse via coordinate inputs, upon trying to click something, it’ll face an immense challenge. For example: it might move the mouse say a little too far to the left of its target.
Then, with a new picture snapped of the mouse to the left of the target (even when the mouse is turned into a large reticle), it’ll continuously say “oh, it looks like the mouse is to the RIGHT of the object. Lets move it slightly to the left” which is incorrect, as it will now continuously move the mouse further to the left away from its target and never click it.
It recognizes what its looking at, it knows what it needs to do and often exactly what it needs to click, it just can’t pinpoint it.
Some people have created workarounds to this challenge. For instance, this person got gpt4 to navigate browser ui by providing gpt with the direct makeup of visible browser elements: (link redacted)meet-gpt-4v-act-a-multimodal-ai-assistant-that-harmoniously-combines-gpt-4vision-with-a-web-browser/
However, this only works with browsers. A lot of the most confusing ui is found elsewhere, such as Intellij/Pycharm or Visual Studio’s project configurations pages.
The other solution which I’ve personally looked into is to use a secondary computer vision model to identify elements, and then maybe couple that with gpt4’s ability to see what its looking at and strategize a plan. Like this person has done in this article here: (link-redacted)dino-gpt-4v/
This can work, and it’s probably my go-to solution in the meantime, albeit not with that specific computer vision model in my case as there are others specifically for finding UI elements, however this is now a communications game between two AI Models: one which can target but can’t necessarily label its elements with the specificity that gpt4 has achieved, and then gpt4 which can label but can’t localize what it sees. It becomes an awkward tag team operation that I imagine, if one model could simply do both like a human can, could have both skills even benefit to inform each other better.
So in any case, thats my question: has its lack of object localization been observed on your (OpenAI’s) end? And are there any plans to potentially improve gpt4’s skill in that one area?