Chained approach vs gpt-4o-audio-preview

coccoinomane · April 20, 2025, 5:52pm

Hello friends,

I am developing a voice agent to help staff at an autorepair shop take appointments for their clients.

I first used the Realtime API but soon realized it is not ready for production.
I am now considering two alternative approaches:

Chained approach, that is, sandwiching the LLM between a speech-to-text (STT) model and a text-to-speech model (TTS).
Multimodal approach, that is, use gpt-4o-audio-preview within the Chat Competion API.

Both approaches are briefly described as described in the docs, but I am still not clear what are the pros and cons of either in terms of latency, transcription quality and audio quality.

Cheers,
Guido

rob266 · April 28, 2025, 12:49pm

I too am facing the same situation. I’ve built an entire app based on the realtime api over the phone, but it just doesn’t behave as well. It seems to lose its instructions, gets caught up, seems to repeat itself. This is odd to me because the one I download from the openai agents using web RTC works great. My only guess is it’s the audio compression that is killing it. That or its still in beta.

I am discovering now to go more to the primitves using GPT 4o or another model and then pipe in the audio agents where needed. The latency will definitely be higher but I’m wiling to sacrifice that for reliability. I can’t go to production the way it behaves now.

I also have not seen any info on the next potential round of updates for the realtime api and when it might be brought out of beta.

coccoinomane · April 30, 2025, 1:00pm

Thanks for sharing your experience, @rob266!

It seems some people out there are having success with the Realtime API, by paring it some “babysitting” agents that would correct it when it gets too creative or goes off the rails.

For more details, have a look at this use case from a popular podcast > Unsupported browser

Topic		Replies	Views
Experience and opinion on 4 APIs (assistant, chat completions, responses, realtime) Feedback	1	209	July 5, 2025
Realtime speech to speech vs chained architecture API api-realtime , api-realtime-speech	3	314	April 11, 2025
Voice differences between Realtime API and Text-to-Speech API realtime , api-realtime	1	1456	January 8, 2025
RAG with voice-voice(end-end) RealTime API API api	17	5899	January 19, 2025
Is the realtime API ready for prime time? Feedback	5	312	April 11, 2025

Chained approach vs gpt-4o-audio-preview

Related topics