I have previously worked with the OpenAI API. Working with text I/O was much easier. When OpenAI sent a function call, I would prepend the function’s output as text to the user’s message.
With the realtime API, though, this seems much more difficult. The options I can think of are:
-
I can convert the text returned from the function to speech and send it after the user finishes speaking, but this can cause problems related to VAD (Voice Activity Detection).
-
Before sending a message via voice, I can send a text message and then, when I send the message by voice, I can hope that thanks to caching, the voice response from OpenAI will be influenced by the text response. But honestly, this feels a bit like gambling to me.
Honestly, both of these methods seem quite unreliable to me. Am I missing something?