As an enthusiastic OpenAI developer who has worked on numerous innovative projects leveraging OpenAI products, I am thrilled about the recent release of the end-to-end voice API. However, I have encountered a challenge that I hope to address:
Is it feasible to implement Retrieval-Augmented Generation (RAG) in an end-to-end voice mode?
Currently, utilizing RAG in a real-time voice API requires a multi-step process: transcribing audio to text, retrieving relevant content from a knowledge base, and then generating audio. This approach remains largely the same irrespective of the tools used, such as Deepgram or Elevenlabs, except for variations in natural sound quality.
I am seeking guidance on how to effectively integrate RAG within an end-to-end voice mode to streamline and enhance the user experience. Are there any best practices or recommended approaches to achieve this seamless integration?
I am also working on RAG over telephony. From what Iâve read so far, it seems we will still need to transcribe the audio and preform embedding + vector search on it so it wouldnât be voice to voice instead it would be text to voice. This is still a benefit compared to current system since you would not need to wait for whole response and then TTS so we would still get a reduction in latency.
Iâm waiting for access to the realtime APIâs so I can work out the exact code needed but⌠The realtime client lets you configure tools so youâll want to define a tool like, search_web or ask_document. The assistant should call this tool when it wants to lookup information and its in that tool that youâll call the other model and do your standard RAG request. The tool can return the answer which the assistant should read to the user.
I still suspect youâre going to want to use tools for this⌠The reason is the voice assistant needs to know that itâs waiting on a lookup to complete so it can properly notify the user that its running a query and then read the results back to them. With tools you donât need to worry about the transcript because the model can just pass the query as a param to the tool. It can also normalize the query in the processâŚ
You know, in some cases, we have to use the RAG for all user queries.
I think the use case of tools are the param detector.
But I need RAG for all user queries.
I would say that you should keep in mind that Realtime cost $100 per 1 million input tokens. That means that doing your RAG as part of your realtime call is 40x the cost of using gpt-4o separately.
With that said, I donât see any sort of posted limit for the instruction so presumably its just a system message. You can shove all of your RAG into there.
âGet transcription firstâ of what content? If it is some cue to the user like, âok, let me look for that informationâ, it is possible through the instructions, such as âacknowledge the user before using the toolâ. This is mentioned in the official documentation.