RAG with voice-voice(end-end) RealTime API

As an enthusiastic OpenAI developer who has worked on numerous innovative projects leveraging OpenAI products, I am thrilled about the recent release of the end-to-end voice API. However, I have encountered a challenge that I hope to address:

Is it feasible to implement Retrieval-Augmented Generation (RAG) in an end-to-end voice mode?

Currently, utilizing RAG in a real-time voice API requires a multi-step process: transcribing audio to text, retrieving relevant content from a knowledge base, and then generating audio. This approach remains largely the same irrespective of the tools used, such as Deepgram or Elevenlabs, except for variations in natural sound quality.

I am seeking guidance on how to effectively integrate RAG within an end-to-end voice mode to streamline and enhance the user experience. Are there any best practices or recommended approaches to achieve this seamless integration?

1 Like

You wont want to do the RAG directly using the realtime model. Instead you’ll want to use tools to bridge to a lower cost model.

3 Likes

I am also working on RAG over telephony. From what I’ve read so far, it seems we will still need to transcribe the audio and preform embedding + vector search on it so it wouldn’t be voice to voice instead it would be text to voice. This is still a benefit compared to current system since you would not need to wait for whole response and then TTS so we would still get a reduction in latency.

2 Likes

Could you let me know about tools?

1 Like

I’m waiting for access to the realtime API’s so I can work out the exact code needed but… The realtime client lets you configure tools so you’ll want to define a tool like, search_web or ask_document. The assistant should call this tool when it wants to lookup information and its in that tool that you’ll call the other model and do your standard RAG request. The tool can return the answer which the assistant should read to the user.

1 Like

Thanks for your response.

I will try and will let you know.

You get a transcription of both the model and the user from the server events.

EDIT

You have to update the current session I believe. I think it’s off by default?

https://platform.openai.com/docs/api-reference/realtime-client-events/session-update

"input_audio_transcription": {
            "enabled": true,
            "model": "whisper-1"
        }
1 Like

Ahh if thats the case then there should be no issues with RAG!

1 Like

Is that possible to control the flow?
I mean, get the transcription first, then retrieve data from knowledge base, then generate voice?

1 Like

I still suspect you’re going to want to use tools for this… The reason is the voice assistant needs to know that it’s waiting on a lookup to complete so it can properly notify the user that its running a query and then read the results back to them. With tools you don’t need to worry about the transcript because the model can just pass the query as a param to the tool. It can also normalize the query in the process…

2 Likes

You know, in some cases, we have to use the RAG for all user queries.
I think the use case of tools are the param detector.
But I need RAG for all user queries.

1 Like

I would say that you should keep in mind that Realtime cost $100 per 1 million input tokens. That means that doing your RAG as part of your realtime call is 40x the cost of using gpt-4o separately.

With that said, I don’t see any sort of posted limit for the instruction so presumably its just a system message. You can shove all of your RAG into there.

1 Like

Yes, but not sure how to do it effectively yet :sweat_smile:

Agreed. It will take some time to get any sort of external information. The function calling flow works pretty nicely

1 Like

Yes this is mostly speculation at this point. I hope to have some basic techniques worked out shortly after gaining access… Patiently waiting…

2 Likes

‘Get transcription first’ of what content? If it is some cue to the user like, “ok, let me look for that information”, it is possible through the instructions, such as “acknowledge the user before using the tool”. This is mentioned in the official documentation.