What will be the final/full released capabilities of GPT-4o in the API?

Hi OpenAI team and community,

In the blog, OpenAI mentions that GPT-4o is capable of taking in text or other inputs and producing text, image and audio natively within the model. Will these capabilities be released within the API eventually? Specifically, would you be able to do things like:

  1. Input text and output an image
  2. Input text and get spoken audio, akin to text to speech but more advanced (say outputting spoken audio of multiple speakers)
  3. Prompt in an image and output audio (either general or spoken audio)

If it was yes to all three of the examples, this would be an incredibly enticing product to use and I would begin to brainstorm ideas around this. However, if it is going to stay as a audio in audio out (voice chat) and what we currently have, that would also be great to know so we do not waste time waiting for something that is not coming.

Appreciate the answer!