Future api for context between images and completions

If using two separate endpoints, it might be challenging to maintain context between them.
Ideally, you may want GPT-4 to understand context from both text and images in a single instance model. When OpenAI releases GPT-4’s image capabilities,I hope that they will provide a way to send/recieve both text and image prompts in a single API call, maintaining the context between them. For instance screenshots of the display and then relating to how to manipulate or maneuver around in reaction to the updated information. Will there be any info soon on when images will be available via gpt 4 and how to use a singular endpoint to interact with it?