True multimodal gpt4-omni from OpenAI's May Release, when and what?

isidro · July 24, 2024, 11:15am

When will gpt-4o be fully released and what will it look like?

I think the leap in capabilities, not in terms of pure “reasoning”, but in enhanced options for interaction and usability, shown by gpt-4o in the May OpenAI product release, and its consequences, are under-discussed. What was shown in the demo will radically disrupt so many of the concepts I am working on, and I’m struggling to find discussions around the details of how it will work, when exactly it might be released, or any other speculation about important open questions.

Mira Murati said one of the reasons gpt-4o will respond faster by voice is it won’t waste time doing text-to-speech and speech-to-text, since it can process and produce audio natively. The demo also showed AI assistants with more consistent ‘styles’, ‘personalities’, or, if that’s taking it too far, at least ‘levels of verbosity/wordiness’, than those that can be achieved today with available models and assistants. Additionally, the new omni model showed ability to adjust on demand, on the fly, their level of emotion and excitement expressed. These are all major innovations.

OpenAI said the full capabilities of gpt-4o would be available soon. Has anyone had access to any of these features already? Or know when they will be available to the public, or if there is any way for an aficionado to get early access?

How will the full gpt-4o model work? Will it be available in the API? If so, will it work similarly to the current assistants API, with text threads and prompts? Will it accept threads and prompts in audio file format, providing information such as intonation and speed that can’t be communicated in text, to get a more convenient, natural and custom response style? Will it accept prompts in both audio format and text format? Will it generate responses containing both audio and text too if prompted?

How will emotion, speed of speech, and level of verbosity be managed? Through specific parameters, just like temperature, top p, etc? Is gpt-4o able to use volume of speech to communicate like humans do too (e.g. to emphasize, to highlight, to seduce…) or will the volume of audio produced be fixed?

Will a model with such additional capabilities cost the same, and maintain the speed, of the current, limited gpt-4o?

Are these even the right questions? Any thoughts?

jr.2509 · July 24, 2024, 11:59am

This was an update posted by OpenAI on LinkedIn regarding some of the further roll-out steps:

Source: OpenAI on LinkedIn: We're sharing an update on the advanced Voice Mode we demoed during our… | 311 comments

grandell1234 · July 24, 2024, 5:08pm

The video and screen sharing capabilities that they said are going to be released separately are probably going to be released within the next few weeks. The MacOS ChatGPT app has recently started asking for permission to view your screen.

vdruts · July 30, 2024, 12:21pm

Been wondering about this myself, feels like it’s been forever since they showcased the model. Fairly disappointing to be honest, especially when you hear they’re in pretty deep financial trouble.

grandell1234 · July 30, 2024, 1:12pm

They have started to roll out the voice access this week.

Topic		Replies	Views
GPT-4o New Voice Model, API Release API	21	22226	July 23, 2024
Will the API for the New Voice Be Released Separately? API	4	2989	September 3, 2024
GPT-4o text to speech and speech to text API	19	19025	September 30, 2024
When do you expect official release of Voice model API for gpt4o? API gpt-4 , api	11	7264	August 22, 2024
OpenAI real-time chat update - Spring 24 Documentation gpt-4 , updates , openai-day	3	2002	October 1, 2024

True multimodal gpt4-omni from OpenAI's May Release, when and what?

Related topics