RAG with voice-voice(end-end) RealTime API

ai70 · October 2, 2024, 7:32pm

As an enthusiastic OpenAI developer who has worked on numerous innovative projects leveraging OpenAI products, I am thrilled about the recent release of the end-to-end voice API. However, I have encountered a challenge that I hope to address:

Is it feasible to implement Retrieval-Augmented Generation (RAG) in an end-to-end voice mode?

Currently, utilizing RAG in a real-time voice API requires a multi-step process: transcribing audio to text, retrieving relevant content from a knowledge base, and then generating audio. This approach remains largely the same irrespective of the tools used, such as Deepgram or Elevenlabs, except for variations in natural sound quality.

I am seeking guidance on how to effectively integrate RAG within an end-to-end voice mode to streamline and enhance the user experience. Are there any best practices or recommended approaches to achieve this seamless integration?

stevenic · October 2, 2024, 7:35pm

You wont want to do the RAG directly using the realtime model. Instead you’ll want to use tools to bridge to a lower cost model.

AndrijaR · October 2, 2024, 7:36pm

I am also working on RAG over telephony. From what I’ve read so far, it seems we will still need to transcribe the audio and preform embedding + vector search on it so it wouldn’t be voice to voice instead it would be text to voice. This is still a benefit compared to current system since you would not need to wait for whole response and then TTS so we would still get a reduction in latency.

ai70 · October 2, 2024, 7:36pm

Could you let me know about tools?

stevenic · October 2, 2024, 7:42pm

I’m waiting for access to the realtime API’s so I can work out the exact code needed but… The realtime client lets you configure tools so you’ll want to define a tool like, search_web or ask_document. The assistant should call this tool when it wants to lookup information and its in that tool that you’ll call the other model and do your standard RAG request. The tool can return the answer which the assistant should read to the user.

ai70 · October 2, 2024, 7:54pm

Thanks for your response.

I will try and will let you know.

RonaldGRuckus · October 2, 2024, 8:10pm

You get a transcription of both the model and the user from the server events.

EDIT

You have to update the current session I believe. I think it’s off by default?

https://platform.openai.com/docs/api-reference/realtime-client-events/session-update

"input_audio_transcription": {
            "enabled": true,
            "model": "whisper-1"
        }

AndrijaR · October 2, 2024, 8:15pm

Ahh if thats the case then there should be no issues with RAG!

ai70 · October 2, 2024, 8:22pm

Is that possible to control the flow?
I mean, get the transcription first, then retrieve data from knowledge base, then generate voice?

stevenic · October 2, 2024, 8:30pm

I still suspect you’re going to want to use tools for this… The reason is the voice assistant needs to know that it’s waiting on a lookup to complete so it can properly notify the user that its running a query and then read the results back to them. With tools you don’t need to worry about the transcript because the model can just pass the query as a param to the tool. It can also normalize the query in the process…

ai70 · October 2, 2024, 8:45pm

You know, in some cases, we have to use the RAG for all user queries.
I think the use case of tools are the param detector.
But I need RAG for all user queries.

stevenic · October 2, 2024, 8:49pm

I would say that you should keep in mind that Realtime cost $100 per 1 million input tokens. That means that doing your RAG as part of your realtime call is 40x the cost of using gpt-4o separately.

With that said, I don’t see any sort of posted limit for the instruction so presumably its just a system message. You can shove all of your RAG into there.

RonaldGRuckus · October 2, 2024, 8:50pm

Yes, but not sure how to do it effectively yet

Agreed. It will take some time to get any sort of external information. The function calling flow works pretty nicely

stevenic · October 2, 2024, 8:52pm

Yes this is mostly speculation at this point. I hope to have some basic techniques worked out shortly after gaining access… Patiently waiting…

vdhavala · October 3, 2024, 4:57pm

‘Get transcription first’ of what content? If it is some cue to the user like, “ok, let me look for that information”, it is possible through the instructions, such as “acknowledge the user before using the tool”. This is mentioned in the official documentation.

Topic		Replies	Views
RAG with Realtime API - samples / gudelines / best practices? API realtime	4	368	October 16, 2024
OpenAI Realtime API w/ Twilio + RAG == AI Call Center Community project , rag , realtime	3	159	October 17, 2024
I don't understand the pricing for the realtime API API realtime	33	2817	October 8, 2024
RAG with RealTime and Web Socket Relay (Push To Talk and VAD) API realtime	2	92	October 15, 2024
Realtime API - What events should be handled? (e.g. for call centers) API	9	418	October 16, 2024

RAG with voice-voice(end-end) RealTime API

Related Topics