What is the best practice for making OpenAI realtime (voice) database queries?

I have previously worked with the OpenAI API. Working with text I/O was much easier. When OpenAI sent a function call, I would prepend the function’s output as text to the user’s message.

With the realtime API, though, this seems much more difficult. The options I can think of are:

  1. I can convert the text returned from the function to speech and send it after the user finishes speaking, but this can cause problems related to VAD (Voice Activity Detection).

  2. Before sending a message via voice, I can send a text message and then, when I send the message by voice, I can hope that thanks to caching, the voice response from OpenAI will be influenced by the text response. But honestly, this feels a bit like gambling to me.

Honestly, both of these methods seem quite unreliable to me. Am I missing something?