Hi OpenAI forum,
We are experimenting using the realtime API to make outbound phone calls. For phone calls that could introduce a long (20 - 40 minutes) wait, either there’s a waiting music playing, or just scilent, I want to make sure that we are not accumulating crazy realtime api cost.
I’m wondering what could be some ideas to optmize this scenario? Much appreciated!
Realtime API does not bill you per a time unit, but for i/o tokens. Also there is a 15 minute idle connection limit. Can you please elaborate what is the exact issue?
Here’s a typical call scenario that I want to automate using the realtime API:
AI: making outbounding call
Human: answering call
AI: describes the issue
Human: ask AI to hold for 20-30 minutes
(hold music lasting 20-30 minutes)
Human tell AI the next step, end the call.
In this scenario, the actual communication between human and AI are just 1-2 minutes and very minimal, but the hold time is very long with noise and music. I’m wondering what’s the best way to automate this call to avoid high realtime API cost.
I may be wrong but this can be achieved with just async TTS and STT.
If you want to do realtime though, the most sensible and optimal way is to have two separate realtime sessions
First one will end when hold starts, and you would save the context of that conversation somewhere in your system.
The second one would start when the hold ends, and you would initialize this second session with the context from the first one.
However, you would also have to introduce a smaller AI or VAD system in order to detect when the human speech starts so that you can know you need to initialize the second session.
There isn’t much in terms of alternatives because of 15 minute idle limit. Maybe you could emulate activity by sending arbitrary events, but it’s questionable
Appreciate the response. I was thinking about a similar approach.
Since you mentioned async TTS and STT, do you think the response speed is as good as real time? I’m also wondering if they are more suitable than real time when handling calls like this? (I started to play with the real time API today so I don’t have a very strong opinion which one to use, but I would like it to sound like human coversation as possible).
It depends on the exact requirements you have, mainly what do you want to do when you get a follow up from a human. If the latency is a concern then you will either have to look for realtime solutions or bootstrap a hybrid approach with something like “say this one pre-generated part while I send a request to generate the rest of the response”, but it doesn’t mean that OpenAI’s new realtime API is the ultimate go-to.
If multiple conversation turns are expected, then OpenAI Realtime API is the best bet though
Appreciate the reminder. This is not a robo call scenario that you are suggesting. We are automating some customer support workflows which involves outbound calling from one department to another.
You might have to look at your own voice-activity-detector, such as webrtcVAD. Then gather statistics about the stream of audio buffer reported on by the library, see if you got someone actually talking over four seconds or more by a very high percentage of packets being high certainty.
These are tuned for human speech as a trigger, and will also adapt to background noise levels (although they need that adaptation period, like if listening to a noisy environment).