Multiturn conversation format using gpt-4o-audio-preview with audio input

I am confused on how to format multi turn conversations using gpt-4o-audio-preview across both text and audio modalities.

I can correctly send in audio, and get back audio and text. (If I don’t include any tools, using a large amount of tools causes a 500 error, reported in another thread).

However, to continue the conversation with the LLM, it seems to complain that I didn’t include the transcript of the input_audio on the first round of the conversation.

Assuming the user is interfacing with only their voice, how is this supposed to be structured in a multi turn audio in and out conversation?

Part of the issue is that I need to maintain a text transcript (of both the assistant and user messages), so that our users can swap between a voice mode and text mode (similar to chatGPT).

I ended up just using whisper to transcribe the user’s message, rather than sending in audio, since the multimodal chat completions doesn’t return the users text. This doesn’t increase the latency all THAT much.

Further, I realized that after I play back the audio response, I just needed to:

  1. copy the transcription to the content (so text mode would work)
  2. nil the audio.transcript
  3. nil the audio.data
  4. leave the audio.id

Then it seems to work properly (still can’t use tools tho!)