It can*
*Sort of
an old screenshot from the lab:
Your job is to find the location of ‘the black talk to claude button’.
((gpt-4o-2024-05-13))
But it’s pretty tricky, and finnicky to boot.
Some believe that once we have proper embodied models, this stuff will become easier. You can perhaps think of it as hand-eye coordination. Babies really struggle with it, and certain neurological conditions make it more difficult for adults. And the current models are definitely missing certain human faculties.
I’m not gonna shill my own products here - I do recommend @anon10827405’s advice and go with tesseract or similar, if your use-case allows.