Chained approach vs gpt-4o-audio-preview

Hello friends,

I am developing a voice agent to help staff at an autorepair shop take appointments for their clients.

I first used the Realtime API but soon realized it is not ready for production.
I am now considering two alternative approaches:

  • Chained approach, that is, sandwiching the LLM between a speech-to-text (STT) model and a text-to-speech model (TTS).
  • Multimodal approach, that is, use gpt-4o-audio-preview within the Chat Competion API.

Both approaches are briefly described as described in the docs, but I am still not clear what are the pros and cons of either in terms of latency, transcription quality and audio quality.

Cheers,
Guido

1 Like

I too am facing the same situation. I’ve built an entire app based on the realtime api over the phone, but it just doesn’t behave as well. It seems to lose its instructions, gets caught up, seems to repeat itself. This is odd to me because the one I download from the openai agents using web RTC works great. My only guess is it’s the audio compression that is killing it. That or its still in beta.

I am discovering now to go more to the primitves using GPT 4o or another model and then pipe in the audio agents where needed. The latency will definitely be higher but I’m wiling to sacrifice that for reliability. I can’t go to production the way it behaves now.

I also have not seen any info on the next potential round of updates for the realtime api and when it might be brought out of beta.

2 Likes

Thanks for sharing your experience, @rob266!

It seems some people out there are having success with the Realtime API, by paring it some “babysitting” agents that would correct it when it gets too creative or goes off the rails.

For more details, have a look at this use case from a popular podcast > Unsupported browser