When will gpt-4o be fully released and what will it look like?
I think the leap in capabilities, not in terms of pure “reasoning”, but in enhanced options for interaction and usability, shown by gpt-4o in the May OpenAI product release, and its consequences, are under-discussed. What was shown in the demo will radically disrupt so many of the concepts I am working on, and I’m struggling to find discussions around the details of how it will work, when exactly it might be released, or any other speculation about important open questions.
Mira Murati said one of the reasons gpt-4o will respond faster by voice is it won’t waste time doing text-to-speech and speech-to-text, since it can process and produce audio natively. The demo also showed AI assistants with more consistent ‘styles’, ‘personalities’, or, if that’s taking it too far, at least ‘levels of verbosity/wordiness’, than those that can be achieved today with available models and assistants. Additionally, the new omni model showed ability to adjust on demand, on the fly, their level of emotion and excitement expressed. These are all major innovations.
OpenAI said the full capabilities of gpt-4o would be available soon. Has anyone had access to any of these features already? Or know when they will be available to the public, or if there is any way for an aficionado to get early access?
How will the full gpt-4o model work? Will it be available in the API? If so, will it work similarly to the current assistants API, with text threads and prompts? Will it accept threads and prompts in audio file format, providing information such as intonation and speed that can’t be communicated in text, to get a more convenient, natural and custom response style? Will it accept prompts in both audio format and text format? Will it generate responses containing both audio and text too if prompted?
How will emotion, speed of speech, and level of verbosity be managed? Through specific parameters, just like temperature, top p, etc? Is gpt-4o able to use volume of speech to communicate like humans do too (e.g. to emphasize, to highlight, to seduce…) or will the volume of audio produced be fixed?
Will a model with such additional capabilities cost the same, and maintain the speed, of the current, limited gpt-4o?
Are these even the right questions? Any thoughts?