Hi OpenAI forum,
We are experimenting using the realtime API to make outbound phone calls. For phone calls that could introduce a long (20 - 40 minutes) wait, either there’s a waiting music playing, or just scilent, I want to make sure that we are not accumulating crazy realtime api cost.
I’m wondering what could be some ideas to optmize this scenario? Much appreciated!
Realtime API does not bill you per a time unit, but for i/o tokens. Also there is a 15 minute idle connection limit. Can you please elaborate what is the exact issue?
Here’s a typical call scenario that I want to automate using the realtime API:
AI: making outbounding call
Human: answering call
AI: describes the issue
Human: ask AI to hold for 20-30 minutes
(hold music lasting 20-30 minutes)
Human tell AI the next step, end the call.
In this scenario, the actual communication between human and AI are just 1-2 minutes and very minimal, but the hold time is very long with noise and music. I’m wondering what’s the best way to automate this call to avoid high realtime API cost.
I may be wrong but this can be achieved with just async TTS and STT.
If you want to do realtime though, the most sensible and optimal way is to have two separate realtime sessions
First one will end when hold starts, and you would save the context of that conversation somewhere in your system.
The second one would start when the hold ends, and you would initialize this second session with the context from the first one.
However, you would also have to introduce a smaller AI or VAD system in order to detect when the human speech starts so that you can know you need to initialize the second session.
There isn’t much in terms of alternatives because of 15 minute idle limit. Maybe you could emulate activity by sending arbitrary events, but it’s questionable
Appreciate the response. I was thinking about a similar approach.
Since you mentioned async TTS and STT, do you think the response speed is as good as real time? I’m also wondering if they are more suitable than real time when handling calls like this? (I started to play with the real time API today so I don’t have a very strong opinion which one to use, but I would like it to sound like human coversation as possible).
It depends on the exact requirements you have, mainly what do you want to do when you get a follow up from a human. If the latency is a concern then you will either have to look for realtime solutions or bootstrap a hybrid approach with something like “say this one pre-generated part while I send a request to generate the rest of the response”, but it doesn’t mean that OpenAI’s new realtime API is the ultimate go-to.
If multiple conversation turns are expected, then OpenAI Realtime API is the best bet though
Appreciate the reminder. This is not a robo call scenario that you are suggesting. We are automating some customer support workflows which involves outbound calling from one department to another.
You might have to look at your own voice-activity-detector, such as webrtcVAD. Then gather statistics about the stream of audio buffer reported on by the library, see if you got someone actually talking over four seconds or more by a very high percentage of packets being high certainty.
These are tuned for human speech as a trigger, and will also adapt to background noise levels (although they need that adaptation period, like if listening to a noisy environment).
Tools is the way to go, you probably didn’t implement it right. Feel free to make a new post to not be off topic here. Feel free to mention me in it and I’ll help you with your issue when I have time.
I’m integrating OpenAI tools with Twilio, and I’m trying to implement the “hangup_call” function. I want to ensure I’m handling both Twilio’s WebSocket for media streaming and OpenAI’s tool correctly. Below are the relevant snippets for both WebSocket handling and OpenAI function call invocation.
WebSocket for Media Streaming from Twilio:
@app.websocket("/media-stream")
async def handle_media_stream(websocket: WebSocket):
logger.info("WebSocket connection opened.")
await websocket.accept()
# Connecting to OpenAI WebSocket
streaming_endpoint = AZURE_OPENAI_API_ENDPOINT.rstrip("/")
streaming_endpoint += f"/openai/realtime?api-version=2024-10-01-preview&deployment={AZURE_OPENAI_DEPLOYMENT_NAME}"
logger.info(f"Connecting to Azure OpenAI WebSocket at: {streaming_endpoint}")
async with websockets.connect(
streaming_endpoint,
additional_headers={"api-key": AZURE_OPENAI_API_KEY},
) as openai_ws:
# Storing active WebSocket connections
active_connections[call_sid] = {
"twilio_ws": websocket,
"openai_ws": openai_ws
}
# Send session instructions to OpenAI
await initialize_session(openai_ws)
async def receive_from_twilio():
nonlocal stream_sid
try:
async for message in websocket.iter_text():
data = json.loads(message)
logger.debug(f"📥 Raw Twilio Input: {data}")
if data.get("event") == "start":
stream_sid = data["start"].get("streamSid")
logger.info(f"🚀 Stream started | SID: {stream_sid}")
elif data.get("event") == "media":
logger.debug("🔊 Received audio chunk from Twilio")
audio_append = {
"type": "input_audio_buffer.append",
"audio": data["media"]["payload"],
}
await openai_ws.send(json.dumps(audio_append))
logger.debug("⬆️ Forwarded audio to OpenAI")
except WebSocketDisconnect:
logger.warning("⚠️ Twilio WebSocket disconnected")
await close_connections(call_sid)
except Exception as e:
logger.error(f"🔴 Twilio Receive Error: {str(e)}")
async def send_to_twilio():
try:
async for openai_message in openai_ws:
response_data = json.loads(openai_message)
if response_data.get("type") == "response.audio.delta":
logger.info("🔊 Sending audio response to Twilio")
# Forwarding audio to Twilio
audio_payload = base64.b64encode(
base64.b64decode(response_data["delta"])
).decode("utf-8")
outgoing = {
"event": "media",
"media": {"payload": audio_payload},
}
await websocket.send_json(outgoing)
logger.debug("⬇️ Sent audio packet to Twilio")
except websockets.exceptions.ConnectionClosedOK:
logger.info("✅ OpenAI Connection Closed Normally")
except Exception as e:
logger.error(f"🔴 OpenAI Send Error: {str(e)}")
await close_connections(call_sid)
# Start receiving and sending data between Twilio and OpenAI
await asyncio.gather(receive_from_twilio(), send_to_twilio())
Handling Function Call (“hangup_call”):
# Handle OpenAI function call to hang up the call
elif response_data.get("type") == "function_call":
logger.info("🛠️ Function Call Detected")
if response_data.get("function") == "hangup_call":
reason = response_data.get("parameters", {}).get("reason", "completed")
logger.info(f"⏹️ Ending call. Reason: {reason}")
await hangup_call(call_sid) # Call the hangup function
Closing Twilio WebSocket:
async def close_connections(call_sid: str):
"""Close Twilio WebSocket connection for a specific call"""
if call_sid in active_connections:
connections = active_connections.get(call_sid)
if connections:
try:
twilio_ws = connections.get("twilio_ws")
if twilio_ws:
try:
await twilio_ws.close() # Close the WebSocket connection
except Exception as e:
logger.warning(f"Error closing twilio_ws: {e}")
del active_connections[call_sid]
logger.info(f"Closed Twilio WebSocket for call {call_sid}")
except Exception as e:
logger.error(f"Error closing Twilio WebSocket for call {call_sid}: {e}")
So the thing it doesn’t exit the socket when done with the call don’t know what’s wrong