… If the user requests a human agent, respond to them first and then call forward_call.
Here’s the response I received:
'response': {'object': 'realtime.response', 'status': 'completed', 'status_details': None, 'output': [{'id': 'item_BAVAtNrdOG1PWy9N3HqkD', 'object': 'realtime.item', 'type': 'message', 'status': 'completed', 'role': 'assistant', 'content': [{'type': 'audio', 'transcript': "Sure, I'll forward your call to a human agent now. Please hold on a moment."}]}, {'id': 'item_BAVAuBlPcad8CIZAteA4J', 'object': 'realtime.item', 'type': 'function_call', 'status': 'completed', 'name': 'forward_call', 'call_id': 'call_ur8h6tqAhcZMYraQ', 'arguments': '{}'}...
However, the function call (forward_call) happens at the same time as the response. This means if I run the function immediately, it will interrupt the AI’s speech, which is not ideal.
Is there a way to ensure the AI finishes speaking before executing the function call? Or is there a better approach to handling this scenario?
Prompting gets you the user output audio generation announcing the function call plan, followed by tool function call in the same response, (instead of the typical function call alone). Or if talking about a function call to come is not useful, and you just want that function call, you can change the prompting approach or the function description.
However, the AI can generate audio faster than its playback rate. Realtime is event-driven and not actually real-time except for input buffering speed.
This needs consideration of exactly what must be held back. If programmed correctly, the only pauses needed would be simply for user experience, inserting a delay between the spoken announcement of the function call to come and the spoken results after function return.
You’d want to stream your audio to your client app that can play a loaded buffer until done, or if you have a monolithic personal app, spin off async or threaded function call execution that can run even while playback is happening instead of blocking the function calling execution.
Hope that helps with what you already know about your programming and backend.
Then: do you want to block voice activity and input stream, so hearing the tool results is mandatory?
Thanks for the thorough response. That makes perfect sense. I see three ways we could tackle this:
We could add a simple 5-second delay before triggering the function. This is the easiest, but it’s definitely a bit of a workaround.
We could wait for the current audio to finish playing before running the function to transfer the call. This is a bit more involved, as we’d need to track the audio state in the WebSocket.
Alternatively, we could prompt to only trigger the function call without the voice response and then trigger the transfer immediately.