I’d like to share this POC that connects the OpenAI Realtime API to a Twilio phone number – and then – augments it with RAG knowledge.
So something like this lets you “Talk to Website” - yeah, literally calling a phone number and talking to the contents of a website (like a support helpdesk)
I’m super excited about this because it democratizes the concept of a fully-automated AI call center.- one that your users can call to talk to anything related to your websites or documents.
What can you do with this : Automated Phone Support
Disclaimer: Will probably keep this demo number up for a few days. You can ask it questions like “Do you have a free trial?” or “Can I ingest documents from my Google Drive?”
How is this done ?
The call management is handled by Twilio voice
The AI and voice interactions is handled by OpenAI realtime API
The RAG and knowledge management is handled by CustomGPT .ai (“fast mode”)
Features:
Human-like interactions similar to ChatGPT AVM (Advanced Voice Mode)
Anti-hallucination (answers are based on your content - not generic ChatGPT knowledge)
Can understand and respond in multiple languages - even if content is in English
Limitations :
The tone in OpenAI’s realtime voice is still quite robotic
There are some odd pauses while the model is thinking and fetching information.
Sometimes phonetic transcriptions don’t work well …
Next Steps:
The biggest drawbacks right now is that it seems a little robotic – would be great to hear if the community has some ideas.
I’d really like to know what features need to be implement in this to have parity with traditional IVRs (PS: I know one feature will be “Press 0 for operator” to give the user an OUT if he is frustrated) – what else?
More info and code (GitHub) – contributions welcome:
h/t @bill.french (you had talked about this a year ago)
No: real-time has either a voice activity detection server-side, listening for silence to start the response, or you can trigger the start of language inference yourself. It still operates like a transformer language model, where tokens are being placed in context window (or simply a buffer and not a particular inference server), and having the AI generate after that is a deliberate action.
OpenAI has just made a stateful chat interface, with less transparency and permanence than assistants, that again follows the paradigm of server-encoded tokens instead of accepting low-bandwidth token numbers, in this case keeping trade secrets of audio as well as images.
The equivalent of streaming would be something I just invented for you (but I’m not the smartest person in the universe to call it non-obvious nor implement) called predictive closure, where after a packet of tokens, an inference is run to see if the AI emits a “they’re still talking” token that is discarded or instead initiates a response to a user, following reinforcement learning on incomplete speech patterns. Then (needing advanced level training) would be interspersing newly received audio in a user container with AI generation to see if the AI wants to continue past noise or listening agreements or wants to emit a stop sequence of being interrupted. All just for the dubious application of pretending to be in a sci-fi movie.
Thanks Thomas – Yeah - there seems to be a 5-10 second lag after asking the question (to retrieve the information from the RAG) – that is too big (it should be sub-second) – we are investigating why that is happening.
PS: I’ve shared the code — you should be able to replace the CustomGPT part with your RAG API with just a few line changes.
UPDATE: I implemented a “typing sound” while it is “fetching” the information from the RAG – and some initial testers seemed to like the typing sound more than the actual responses - LOL … the slight delay in no longer an issue, the user seems to think its a feature, rather than a bug.
PS: The latency is temporary – I’m sure things will get faster soon to sub-second levels. Of the 6-second delay, 4 seconds is gpt-4o-mini. (I tried faster models like groq but they are returning terrible results – so the speed/accuracy tradeoff was not worth it)