Why is native image markup still a hurdle for GPT models? And is Open AI working on such capabilities?

Hi guys!

So, I posted this as a feature request exactly 1 year ago. I asked if it would be possible to provide GPT-4v the ability to mark up images. Since then, I have come to understand that GPT-4v wasn’t natively multimodal and would have required a collection of additional tools/models to help it achieve that. However, I was just wondering, given that AI models have advanced so much and that GPT-4o is natively multimodal, why is this something that is still a challenge for current AI models? And is this something that we can expect OpenAI to incorporate into its next-generation models?

As an example, last year when I asked for this markup feature to be incorporated, I didn’t quite understand how to read a humidity chart (also known as a psychrometric chart), and I was hoping instead of just telling me how to read it, GPT could show it to me by drawing lines and curves over the chart and then explaining it to me as a teacher would. Now, this would simply involve tracing over existing lines and curves, and obviously, it wasn’t something GPT-4v was capable of. But despite natively multimodal models and numerous PhD-level models being released, this still seems to be something that AI struggles with. Why is that? And is this something we can expect OpenAI to address in any of its upcoming models?

I don’t know much about how these models work. However, something as simple as tracing a line over a curve and then moving it a few notches down seems like it should be pretty straightforward, given that even a child could do it. Moreover, can developers do anything to achieve the same thing using existing models such as GPT-4o or even GPT-4o-mini through the API?

Thank you so much to everyone that takes the time to read this!