Gpt-4o-audio-preview poor performance

I’m reaching out to discuss some observations and ideas regarding GPT’s voice functionality, along with potential integrations and improvements:

  1. Performance with Audio Input:
    While GPT’s voice feature is quite impressive, we’ve noticed that it doesn’t always perform as expected when understanding audio input. However, it responds well once the input is understood. Could you provide insights into why this might be happening and any plans for improvements in this area?

  2. Integration of Audio and Text Modes:
    Having separate sessions for voice and text creates a fragmented user experience. Integrating them into a unified interface, akin to how ChatGPT operates, would significantly enhance usability. Are there any ongoing developments or recommendations to achieve this?

  3. Idea for Gemini-GPT Synergy:
    We’ve been exploring a potential synergy where Gemini could serve as the audio digestion mechanism, parsing audio into text, and then passing the text to GPT to generate audio responses. This approach could combine the strengths of both models.

    Have there been discussions about such integrations, and are there tools or APIs that could support this?

We’d love to hear your thoughts, suggestions, or any insights on these topics. Additionally, we’ll be aligning with other teams like Phidata to strategize further and would appreciate any guidance you can provide.