Reduce Realtime API costs: handle long waits

allanjsx · November 20, 2024, 9:06am

Hi OpenAI forum,
We are experimenting using the realtime API to make outbound phone calls. For phone calls that could introduce a long (20 - 40 minutes) wait, either there’s a waiting music playing, or just scilent, I want to make sure that we are not accumulating crazy realtime api cost.

I’m wondering what could be some ideas to optmize this scenario? Much appreciated!

ivan-luchkin-u · November 20, 2024, 9:58am

Realtime API does not bill you per a time unit, but for i/o tokens. Also there is a 15 minute idle connection limit. Can you please elaborate what is the exact issue?

allanjsx · November 20, 2024, 1:28pm

Thanks for reply!

Here’s a typical call scenario that I want to automate using the realtime API:

AI: making outbounding call
Human: answering call
AI: describes the issue
Human: ask AI to hold for 20-30 minutes
(hold music lasting 20-30 minutes)
Human tell AI the next step, end the call.

In this scenario, the actual communication between human and AI are just 1-2 minutes and very minimal, but the hold time is very long with noise and music. I’m wondering what’s the best way to automate this call to avoid high realtime API cost.

ivan-luchkin-u · November 20, 2024, 1:43pm

I may be wrong but this can be achieved with just async TTS and STT.
If you want to do realtime though, the most sensible and optimal way is to have two separate realtime sessions

First one will end when hold starts, and you would save the context of that conversation somewhere in your system.

The second one would start when the hold ends, and you would initialize this second session with the context from the first one.

However, you would also have to introduce a smaller AI or VAD system in order to detect when the human speech starts so that you can know you need to initialize the second session.

There isn’t much in terms of alternatives because of 15 minute idle limit. Maybe you could emulate activity by sending arbitrary events, but it’s questionable

_j · November 20, 2024, 1:43pm

The best way is to avoid committing crimes.

https://www.ecfr.gov/current/title-16/chapter-I/subchapter-C/part-310/section-310.4#p-310.4(b)(1)

allanjsx · November 20, 2024, 4:05pm

Appreciate the response. I was thinking about a similar approach.

Since you mentioned async TTS and STT, do you think the response speed is as good as real time? I’m also wondering if they are more suitable than real time when handling calls like this? (I started to play with the real time API today so I don’t have a very strong opinion which one to use, but I would like it to sound like human coversation as possible).

Thank you!

ivan-luchkin-u · November 20, 2024, 4:08pm

It depends on the exact requirements you have, mainly what do you want to do when you get a follow up from a human. If the latency is a concern then you will either have to look for realtime solutions or bootstrap a hybrid approach with something like “say this one pre-generated part while I send a request to generate the rest of the response”, but it doesn’t mean that OpenAI’s new realtime API is the ultimate go-to.

If multiple conversation turns are expected, then OpenAI Realtime API is the best bet though

allanjsx · November 20, 2024, 4:11pm

Appreciate the reminder. This is not a robo call scenario that you are suggesting. We are automating some customer support workflows which involves outbound calling from one department to another.

allanjsx · November 20, 2024, 4:14pm

Got it, thank you so much! I think I will try the multi realtime session first, if it works, it might be the easiest solution.

Thank you!

_j · November 20, 2024, 4:38pm

You might have to look at your own voice-activity-detector, such as webrtcVAD. Then gather statistics about the stream of audio buffer reported on by the library, see if you got someone actually talking over four seconds or more by a very high percentage of packets being high certainty.

These are tuned for human speech as a trigger, and will also adapt to background noise levels (although they need that adaptation period, like if listening to a noisy environment).

toontown · February 18, 2025, 5:10am

Have you figured Out the Ending call automatically when done with the conversation by the bot in the outbound call ?

i am not able to end that call tried tools and stuff in websocket code but its not at allworking if you have the working thing could you help ??

j.wischnat · February 18, 2025, 8:10am

Tools is the way to go, you probably didn’t implement it right. Feel free to make a new post to not be off topic here. Feel free to mention me in it and I’ll help you with your issue when I have time.

toontown · February 18, 2025, 1:41pm

I’m integrating OpenAI tools with Twilio, and I’m trying to implement the “hangup_call” function. I want to ensure I’m handling both Twilio’s WebSocket for media streaming and OpenAI’s tool correctly. Below are the relevant snippets for both WebSocket handling and OpenAI function call invocation.

WebSocket for Media Streaming from Twilio:

@app.websocket("/media-stream")
async def handle_media_stream(websocket: WebSocket):
    logger.info("WebSocket connection opened.")
    await websocket.accept()

    # Connecting to OpenAI WebSocket
    streaming_endpoint = AZURE_OPENAI_API_ENDPOINT.rstrip("/")
    streaming_endpoint += f"/openai/realtime?api-version=2024-10-01-preview&deployment={AZURE_OPENAI_DEPLOYMENT_NAME}"

    logger.info(f"Connecting to Azure OpenAI WebSocket at: {streaming_endpoint}")
    async with websockets.connect(
        streaming_endpoint,
        additional_headers={"api-key": AZURE_OPENAI_API_KEY},
    ) as openai_ws:

        # Storing active WebSocket connections
        active_connections[call_sid] = {
            "twilio_ws": websocket,
            "openai_ws": openai_ws
        }

        # Send session instructions to OpenAI
        await initialize_session(openai_ws)

        async def receive_from_twilio():
            nonlocal stream_sid
            try:
                async for message in websocket.iter_text():
                    data = json.loads(message)
                    logger.debug(f"📥 Raw Twilio Input: {data}")

                    if data.get("event") == "start":
                        stream_sid = data["start"].get("streamSid")
                        logger.info(f"🚀 Stream started | SID: {stream_sid}")

                    elif data.get("event") == "media":
                        logger.debug("🔊 Received audio chunk from Twilio")
                        audio_append = {
                            "type": "input_audio_buffer.append",
                            "audio": data["media"]["payload"],
                        }
                        await openai_ws.send(json.dumps(audio_append))
                        logger.debug("⬆️ Forwarded audio to OpenAI")

            except WebSocketDisconnect:
                logger.warning("⚠️ Twilio WebSocket disconnected")
                await close_connections(call_sid)
            except Exception as e:
                logger.error(f"🔴 Twilio Receive Error: {str(e)}")

        async def send_to_twilio():
            try:
                async for openai_message in openai_ws:
                    response_data = json.loads(openai_message)

                    if response_data.get("type") == "response.audio.delta":
                        logger.info("🔊 Sending audio response to Twilio")
                        
                        # Forwarding audio to Twilio
                        audio_payload = base64.b64encode(
                            base64.b64decode(response_data["delta"])
                        ).decode("utf-8")
                        
                        outgoing = {
                            "event": "media",
                            "media": {"payload": audio_payload},
                        }
                        await websocket.send_json(outgoing)
                        logger.debug("⬇️ Sent audio packet to Twilio")
            except websockets.exceptions.ConnectionClosedOK:
                logger.info("✅ OpenAI Connection Closed Normally")
            except Exception as e:
                logger.error(f"🔴 OpenAI Send Error: {str(e)}")
                await close_connections(call_sid)

        # Start receiving and sending data between Twilio and OpenAI
        await asyncio.gather(receive_from_twilio(), send_to_twilio())

Handling Function Call (“hangup_call”):

# Handle OpenAI function call to hang up the call
elif response_data.get("type") == "function_call":
    logger.info("🛠️ Function Call Detected")
    if response_data.get("function") == "hangup_call":
        reason = response_data.get("parameters", {}).get("reason", "completed")
        logger.info(f"⏹️ Ending call. Reason: {reason}")
        await hangup_call(call_sid)  # Call the hangup function

Closing Twilio WebSocket:


async def close_connections(call_sid: str):
    """Close Twilio WebSocket connection for a specific call"""
    if call_sid in active_connections:
        connections = active_connections.get(call_sid)
        if connections:
            try:
                twilio_ws = connections.get("twilio_ws")
                if twilio_ws:
                    try:
                        await twilio_ws.close()  # Close the WebSocket connection
                    except Exception as e:
                        logger.warning(f"Error closing twilio_ws: {e}")
                del active_connections[call_sid]
                logger.info(f"Closed Twilio WebSocket for call {call_sid}")
            except Exception as e:
                logger.error(f"Error closing Twilio WebSocket for call {call_sid}: {e}")

So the thing it doesn’t exit the socket when done with the call don’t know what’s wrong

Sharan_sidhu · April 14, 2025, 8:07am

You can update the call status to ‘completed,’ which will terminate the call.

client.calls(call_sid).update(status=‘completed’)

Topic		Replies	Views
How to make OpenAI Realtime API agent end Twilio call programmatically? API	3	517	April 22, 2025
Interruption not implemented out of the box in the Twilio Example API turn-control , realtime	17	1733	October 13, 2024
Realtime api phone use case - speaking text Feedback assistants-api , realtime	16	1495	November 5, 2024
Handling early conversation closure API function-calling , long-context , voice , realtime	8	637	February 24, 2025
Need assistance for Twilio OpenAI API tool call failing? Bugs function-calling , assistants-api , realtime , api-realtime , api-realtime-speech	0	120	March 15, 2025

Reduce Realtime API costs: handle long waits

Related topics