True multimodal gpt4-omni from OpenAI's May Release, when and what?

When will gpt-4o be fully released and what will it look like?

I think the leap in capabilities, not in terms of pure “reasoning”, but in enhanced options for interaction and usability, shown by gpt-4o in the May OpenAI product release, and its consequences, are under-discussed. What was shown in the demo will radically disrupt so many of the concepts I am working on, and I’m struggling to find discussions around the details of how it will work, when exactly it might be released, or any other speculation about important open questions.

Mira Murati said one of the reasons gpt-4o will respond faster by voice is it won’t waste time doing text-to-speech and speech-to-text, since it can process and produce audio natively. The demo also showed AI assistants with more consistent ‘styles’, ‘personalities’, or, if that’s taking it too far, at least ‘levels of verbosity/wordiness’, than those that can be achieved today with available models and assistants. Additionally, the new omni model showed ability to adjust on demand, on the fly, their level of emotion and excitement expressed. These are all major innovations.

OpenAI said the full capabilities of gpt-4o would be available soon. Has anyone had access to any of these features already? Or know when they will be available to the public, or if there is any way for an aficionado to get early access?

How will the full gpt-4o model work? Will it be available in the API? If so, will it work similarly to the current assistants API, with text threads and prompts? Will it accept threads and prompts in audio file format, providing information such as intonation and speed that can’t be communicated in text, to get a more convenient, natural and custom response style? Will it accept prompts in both audio format and text format? Will it generate responses containing both audio and text too if prompted?

How will emotion, speed of speech, and level of verbosity be managed? Through specific parameters, just like temperature, top p, etc? Is gpt-4o able to use volume of speech to communicate like humans do too (e.g. to emphasize, to highlight, to seduce…) or will the volume of audio produced be fixed?

Will a model with such additional capabilities cost the same, and maintain the speed, of the current, limited gpt-4o?

Are these even the right questions? Any thoughts?

2 Likes

This was an update posted by OpenAI on LinkedIn regarding some of the further roll-out steps:

image

Source: OpenAI on LinkedIn: We're sharing an update on the advanced Voice Mode we demoed during our… | 311 comments

1 Like

The video and screen sharing capabilities that they said are going to be released separately are probably going to be released within the next few weeks. The MacOS ChatGPT app has recently started asking for permission to view your screen.

2 Likes

Been wondering about this myself, feels like it’s been forever since they showcased the model. Fairly disappointing to be honest, especially when you hear they’re in pretty deep financial trouble.

1 Like

They have started to roll out the voice access this week.

2 Likes