Realtime speech to speech vs chained architecture

I’ve been working on a mobile application that leverages the realtime api to allow the user to have a conversation with a tutor. I am currently using the realtime speech to speech model but I recently saw that OAI released the chained architecture for voice (voice → text → LLM → text → voice) which has more latency and uses text under the hood. I would like to test this out, but would need to completely refactor my code since this architecture uses the agent framework while speech to speech uses webRTC.
Is this something openAI is looking to add to their realtime framwork? Or should I just bite the bullet and refactor the code? Latency is not a big concern for me, which is why I would like to test if I could get better responses with the chained architecture

since this architecture uses the agent framework

You can setup this entire architecture without using agent framework. I tried it for web and mobile application which allows the user to write about their work in a work journal. I haven’t tried the webRTC setup though, so I am not sure about the better responses part.

Ok how did you do that? If you could point me to some documentation on that it would be super helpful. Also, what was the latency for that?

It is the same flow as chained architecture like you shared (voice → text → LLM → text → voice).

Sure here are the links to the related documentation:

In my use case, the latency depends on the amount of work which is shared by user through voice. You should be able to try it out on the web app here.