Realtime API - What events should be handled? (e.g. for call centers)

The Realtime API is a game changer and the reference implementations are quite good to get started.

However, I wish there was some cookbook that shows the events and UX cases that OpenAI handled in the Advanced Voice Mode – this will help us create similar apps.

For example, I just got a POC done for Realtime API + RAG and it is working great as far as functionality is concerned (the automated call center is answering questions based on the RAG knowledgebase) – but the UX is terrible and very un-humanlike.

Issues I am having:

  1. What is a good human-like start to the interaction and how to implement it? (how did AVM do it?)
  2. How to handle silence?
  3. How and when to terminate the call? (e.g. if the user fell asleep or got distracted – can’t have the voice call go on while Twilio and OpenAI charges accumulate)
  4. How to implement interruptions correctly? (sometimes the Realtime API keeps talking for a minute)

These are just the quick UX issues I see in 5 mins of testing – wish there was a reference implementation.

2 Likes

One thing to note is that the cost of output is based on generated tokens and not minutes. The importance of that is that generation of the audio happens a lot faster then the playback. You can terminate the audio playback but it’s not going to save you any money because the audios already been generated.

I suspect that the best practice for keeping output costs in check is just going to be to use max_tokens.

Another thing that looks like a best practice is to add a “push-to-talk” feature to your app. That has a couple of benefits:

  • Keeps background noise from accidentally triggering an interruption or response which helps keep cost down.
  • Avoids sensitivity issues with the server_vad threshold settings. The user has to explicitly initiate an interruption.
3 Likes

What’s the issue with silence? I’ve verified that you’re not charged by OpenAI for silence. Twilio you are of course. Standard practice with voice calls is to typically timeout after a period of inactivity since its easy enough to measure how long silence has been going on.

Also worth noting that you can employ many of the techniques you would in traditional IVR. You can pre-record certain responses like “are you still there?” and play them back for free.

3 Likes

You’re spot on! The issue with silence isn’t from OpenAI’s side—they don’t charge for silence since they bill based on the tokens generated. The problem is with services like Twilio, which charge for the time the call is active, whether anyone’s speaking or not. So, if there’s dead air on the line, it’s Twilio that’s quietly racking up the bill, not OpenAI.

Why It’s an Issue in Voice Calls:

  • Twilio charges by the minute for active calls, so if the user is just sitting there silently, you’re still getting billed. That’s why timeout practices are so common in voice services—after a certain period of inactivity, you play a pre-recorded message or just terminate the call to stop those charges from piling up.

Pre-recorded Responses:

This is a smart move! You can use pre-recorded messages for common prompts like “Are you still there?” or “Press any key if you’d like to continue.” Since they’re pre-recorded, you’re not sending any new requests to OpenAI, so there’s no extra cost for generating responses, but it still keeps the call flowing naturally.

Best Practice for Silence:

  • It’s pretty standard to set a timeout after, say, 20-30 seconds of silence. Right before the timeout, you can play a pre-recorded prompt asking the user if they’re still there. If there’s no response, you end the call to save on Twilio charges.

All in all, managing silence is mostly about cost-saving on the Twilio side and ensuring that the user experience stays smooth and doesn’t feel like they’re just hanging on a dead line.

1 Like

I worked on IVR systems back in the 90’s before I joined Microsoft in 2003. The Microsoft Bot Framework started from my hackathon project so it feels like I’ve been working on chat bots for almost 30 years :slight_smile:

There’s a lot of tricks, like pre-recorded responses, that are still very much relevant in this generative AI world…

3 Likes

One thing to note is that the cost of output is based on generated tokens and not minutes. The importance of that is that generation of the audio happens a lot faster then the playback. You can terminate the audio playback but it’s not going to save you any money because the audios already been generated.

Good insights … that helps (from the OpenAI) end – though might add to the Twilio costs.

Another thing that looks like a best practice is to add a “push-to-talk” feature to your app. That has a couple of benefits:

Thanks – good suggestion – but that is very un-humanlike … the hope is to replicate the AVM experience. (PS: Also, in my particular case, the user is interacting over the phone like a call center – so push-to-talk is not an option – I’m implementing a “OpenAI Realtime API + RAG” fully-automated call center – so the push-to-talk would not be an option. (But great suggestion for smartphone apps)

3 Likes

Awesome – will implement that (PS: I was hoping to see a ready-to-go cookbook from OpenAI if it is standard practice)

Also worth noting that you can employ many of the techniques you would in traditional IVR. You can pre-record certain responses like “are you still there?” and play them back for free.

Good suggestion. (I typically want to avoid the pre-recording because it would prevent voice and language customization, no?)

By the way, in my testing, I was able to request the voice to switch language – that is pretty cool to see. By saying “Please respond in Hindi”, it started responding in Hindi (accent was pretty good too!)

2 Likes

Nice – Yeah – it feels like we need to dust out some of those IVR practices and marry them with the OpenAI realtime API to create the 2024 UX of phone calls.

We got the Realtime+RAG working two days ago – and have been struggling with UX since (silence, interruptions, anti-hallucination, initial message UX, things like that)

3 Likes

UPDATE: Just posted the source code for the POC : OpenAI Realtime API w/ Twilio + RAG == AI Call Center

2 Likes