OpenAI Realtime API w/ Twilio + RAG == AI Call Center

I’d like to share this POC that connects the OpenAI Realtime API to a Twilio phone number – and then – augments it with RAG knowledge.

So something like this lets you “Talk to Website” - yeah, literally calling a phone number and talking to the contents of a website (like a support helpdesk)

I’m super excited about this because it democratizes the concept of a fully-automated AI call center.- one that your users can call to talk to anything related to your websites or documents.

What can you do with this : Automated Phone Support

Live Demo : Call +1 850-779-0711 or WhatsApp Call

Disclaimer: Will probably keep this demo number up for a few days. You can ask it questions like “Do you have a free trial?” or “Can I ingest documents from my Google Drive?”

How is this done ?

  • The call management is handled by Twilio voice
  • The AI and voice interactions is handled by OpenAI realtime API
  • The RAG and knowledge management is handled by CustomGPT .ai (“fast mode”)

Features:

  • Human-like interactions similar to ChatGPT AVM (Advanced Voice Mode)
  • Anti-hallucination (answers are based on your content - not generic ChatGPT knowledge)
  • Can understand and respond in multiple languages - even if content is in English

Limitations :

  • The tone in OpenAI’s realtime voice is still quite robotic
  • There are some odd pauses while the model is thinking and fetching information.
  • Sometimes phonetic transcriptions don’t work well …

Next Steps:

  • The biggest drawbacks right now is that it seems a little robotic – would be great to hear if the community has some ideas.
  • I’d really like to know what features need to be implement in this to have parity with traditional IVRs (PS: I know one feature will be “Press 0 for operator” to give the user an OUT if he is frustrated) – what else?

More info and code (GitHub) – contributions welcome:

h/t @bill.french (you had talked about this a year ago)

3 Likes

It’s good, but a bit slow on its responses. Does the realtime API have something equivalent to “streaming”?

1 Like

No: real-time has either a voice activity detection server-side, listening for silence to start the response, or you can trigger the start of language inference yourself. It still operates like a transformer language model, where tokens are being placed in context window (or simply a buffer and not a particular inference server), and having the AI generate after that is a deliberate action.

OpenAI has just made a stateful chat interface, with less transparency and permanence than assistants, that again follows the paradigm of server-encoded tokens instead of accepting low-bandwidth token numbers, in this case keeping trade secrets of audio as well as images.

The equivalent of streaming would be something I just invented for you (but I’m not the smartest person in the universe to call it non-obvious nor implement) called predictive closure, where after a packet of tokens, an inference is run to see if the AI emits a “they’re still talking” token that is discarded or instead initiates a response to a user, following reinforcement learning on incomplete speech patterns. Then (needing advanced level training) would be interspersing newly received audio in a user container with AI generation to see if the AI wants to continue past noise or listening agreements or wants to emit a stop sequence of being interrupted. All just for the dubious application of pretending to be in a sci-fi movie.

3 Likes

Thanks Thomas – Yeah - there seems to be a 5-10 second lag after asking the question (to retrieve the information from the RAG) – that is too big (it should be sub-second) – we are investigating why that is happening.

PS: I’ve shared the code — you should be able to replace the CustomGPT part with your RAG API with just a few line changes.

UPDATE: I implemented a “typing sound” while it is “fetching” the information from the RAG – and some initial testers seemed to like the typing sound more than the actual responses - LOL … the slight delay in no longer an issue, the user seems to think its a feature, rather than a bug.

PS: The latency is temporary – I’m sure things will get faster soon to sub-second levels. Of the 6-second delay, 4 seconds is gpt-4o-mini. (I tried faster models like groq but they are returning terrible results – so the speed/accuracy tradeoff was not worth it)