Realtime speech to speech vs chained architecture

arath36 · April 7, 2025, 9:53pm

I’ve been working on a mobile application that leverages the realtime api to allow the user to have a conversation with a tutor. I am currently using the realtime speech to speech model but I recently saw that OAI released the chained architecture for voice (voice → text → LLM → text → voice) which has more latency and uses text under the hood. I would like to test this out, but would need to completely refactor my code since this architecture uses the agent framework while speech to speech uses webRTC.
Is this something openAI is looking to add to their realtime framwork? Or should I just bite the bullet and refactor the code? Latency is not a big concern for me, which is why I would like to test if I could get better responses with the chained architecture

sudarsangp · April 8, 2025, 6:43am

since this architecture uses the agent framework

You can setup this entire architecture without using agent framework. I tried it for web and mobile application which allows the user to write about their work in a work journal. I haven’t tried the webRTC setup though, so I am not sure about the better responses part.

arath36 · April 10, 2025, 5:13pm

Ok how did you do that? If you could point me to some documentation on that it would be super helpful. Also, what was the latency for that?

sudarsangp · April 11, 2025, 9:51am

It is the same flow as chained architecture like you shared (voice → text → LLM → text → voice).

Sure here are the links to the related documentation:

In my use case, the latency depends on the amount of work which is shared by user through voice. You should be able to try it out on the web app here.

Topic		Replies	Views
Implementing audio conversation with AI API	8	4015	April 29, 2024
How would you go about implementing this? Am I hallucinating? API	8	151	November 23, 2024
Voice differences between Realtime API and Text-to-Speech API realtime , api-realtime	1	824	January 8, 2025
ChatGPT API TTS streaming API api	3	4456	January 21, 2025
RAG with voice-voice(end-end) RealTime API API api	17	4699	January 19, 2025

Realtime speech to speech vs chained architecture

Related topics