Ensuring AI Speech Completes Before Executing Function Calls

123qws · March 13, 2025, 5:22am

I have the following prompt:

… If the user requests a human agent, respond to them first and then call forward_call.

Here’s the response I received:

'response': {'object': 'realtime.response', 'status': 'completed', 'status_details': None, 'output': [{'id': 'item_BAVAtNrdOG1PWy9N3HqkD', 'object': 'realtime.item', 'type': 'message', 'status': 'completed', 'role': 'assistant', 'content': [{'type': 'audio', 'transcript': "Sure, I'll forward your call to a human agent now. Please hold on a moment."}]}, {'id': 'item_BAVAuBlPcad8CIZAteA4J', 'object': 'realtime.item', 'type': 'function_call', 'status': 'completed', 'name': 'forward_call', 'call_id': 'call_ur8h6tqAhcZMYraQ', 'arguments': '{}'}...

However, the function call (forward_call) happens at the same time as the response. This means if I run the function immediately, it will interrupt the AI’s speech, which is not ideal.

Is there a way to ensure the AI finishes speaking before executing the function call? Or is there a better approach to handling this scenario?

Thanks in advance!

_j · March 13, 2025, 6:00am

Hi!

Sounds like the basic issue here is in your code.

Prompting gets you the user output audio generation announcing the function call plan, followed by tool function call in the same response, (instead of the typical function call alone). Or if talking about a function call to come is not useful, and you just want that function call, you can change the prompting approach or the function description.

However, the AI can generate audio faster than its playback rate. Realtime is event-driven and not actually real-time except for input buffering speed.

This needs consideration of exactly what must be held back. If programmed correctly, the only pauses needed would be simply for user experience, inserting a delay between the spoken announcement of the function call to come and the spoken results after function return.

You’d want to stream your audio to your client app that can play a loaded buffer until done, or if you have a monolithic personal app, spin off async or threaded function call execution that can run even while playback is happening instead of blocking the function calling execution.

Hope that helps with what you already know about your programming and backend.

Then: do you want to block voice activity and input stream, so hearing the tool results is mandatory?

123qws · March 13, 2025, 9:38pm

Thanks for the thorough response. That makes perfect sense. I see three ways we could tackle this:

We could add a simple 5-second delay before triggering the function. This is the easiest, but it’s definitely a bit of a workaround.
We could wait for the current audio to finish playing before running the function to transfer the call. This is a bit more involved, as we’d need to track the audio state in the WebSocket.
Alternatively, we could prompt to only trigger the function call without the voice response and then trigger the transfer immediately.

Topic		Replies	Views
Realtime api phone use case - speaking text Feedback assistants-api , realtime	16	1453	November 5, 2024
Long function calls and realtime API API realtime , api-realtime	0	203	February 12, 2025
Handling Overlapping Responses in Realtime API When Tools Take Too Long API realtime	0	145	April 1, 2025
Realtime API sometimes creates speech before a tool call, sometimes doesn't API realtime , api-realtime , api-realtime-speech	0	119	March 27, 2025
How to manage user silence in Twilio calls? [openai realtime api] API	5	195	April 11, 2025

Ensuring AI Speech Completes Before Executing Function Calls

Related topics