Multiturn conversation format using gpt-4o-audio-preview with audio input

pj4533 · November 12, 2024, 5:01pm

I am confused on how to format multi turn conversations using gpt-4o-audio-preview across both text and audio modalities.

I can correctly send in audio, and get back audio and text. (If I don’t include any tools, using a large amount of tools causes a 500 error, reported in another thread).

However, to continue the conversation with the LLM, it seems to complain that I didn’t include the transcript of the input_audio on the first round of the conversation.

Assuming the user is interfacing with only their voice, how is this supposed to be structured in a multi turn audio in and out conversation?

pj4533 · November 12, 2024, 6:10pm

Part of the issue is that I need to maintain a text transcript (of both the assistant and user messages), so that our users can swap between a voice mode and text mode (similar to chatGPT).

I ended up just using whisper to transcribe the user’s message, rather than sending in audio, since the multimodal chat completions doesn’t return the users text. This doesn’t increase the latency all THAT much.

Further, I realized that after I play back the audio response, I just needed to:

copy the transcription to the content (so text mode would work)
nil the audio.transcript
nil the audio.data
leave the audio.id

Then it seems to work properly (still can’t use tools tho!)

Topic		Replies	Views
How use response_format to get transcript for voice input along with the text output API chatgpt , api , response_format , gpt-4o-audio-preview	0	273	January 10, 2025
Realtime API Audio Modality output API realtime , api-realtime , api-realtime-speech	7	803	December 13, 2024
Realtime API re-consuming it's own output audio as input audio API audio , realtime , api-realtime , api-realtime-speech	10	871	January 10, 2025
How can I pass a system prompt and audio user input to get a text output back? API	15	1261	November 3, 2024
Audio support in the Chat Completions API Announcements	13	4748	December 12, 2024

Multiturn conversation format using gpt-4o-audio-preview with audio input

Related topics