How to download audio from gpt-4o-audio-preview

Here is a sample conversation with audio-preview:

curl "https://api.openai.com/v1/chat/completions" \
    -H "Content-Type: application/json" \
    -H "Authorization: Bearer $OPENAI_API_KEY" \
    -d '{
        "model": "gpt-4o-audio-preview",
        "modalities": ["text", "audio"],
        "audio": { "voice": "alloy", "format": "wav" },
        "messages": [
            {
                "role": "user",
                "content": "Is a golden retriever a good family dog?"
            },
            {
                "role": "assistant",
                "audio": {
                    "id": "audio_abc123"
                }
            },
            {
                "role": "user",
                "content": "Why do you say they are loyal?"
            }
        ]
    }

How can I download the audio id mentioned above ?

You can do that by by writing the bytes of the data attribute from the audio object contained within the received chat completion message, to a file.

import base64
from openai import OpenAI

client = OpenAI()

completion = client.chat.completions.create(
    model=“gpt-4o-audio-preview”,
    modalities=[“text”, “audio”],
    audio={“voice”: “alloy”, “format”: “wav”},
    messages=[
        {
            “role”: “user”,
            “content”: “Is a golden retriever a suitable family dog?”
        }
    ]
)

print(completion.choices[0])

wav_bytes = base64.b64decode(completion.choices[0].message.audio.data)
with open(“dog.wav”, “wb”) as f:
    f.write(wav_bytes)
3 Likes

The audio ID itself cannot be replayed or recovered. That assistant output is stored server-side just for continuing a conversation, with expiration.

The obvious reason for this ID system for chat history audio is because OpenAI doesn’t want the ability for developers to place their own audio in API requests for the voice or messages the assistant responds with, to retrain output with in-context learning. It also breaks long-term chat continuations by expiring.

You’ll have to save the original response message and its generated audio part, to allow a chat UI to replay what was previously spoken if that’s the desired application.

For that response data collection as example, I just tacked on an audio extractor to replace streaming tool, function, and other object collection from a Python httpx request (not OpenAI SDK).

    if response.status_code != 200:
        print(f"HTTP error {response.status_code}: {response.text}")
        # retry/reprompt
        continue
    else:
        print("API request: success")
        response_content = b''
        for chunk in response.iter_bytes(chunk_size=8192):
            if chunk:
                response_content += chunk
        response_data = json.loads(response_content.decode('utf-8'))

        if 'choices' in response_data and response_data['choices']:
            print("-- choices list received --")
            choice = response_data['choices'][0]['message']
            reply = choice.get('content', "")
            audio_data = choice.get('audio', {})
            audio_base64 = audio_data.get('data', "")
            transcript = audio_data.get('transcript', "")
            print(reply if reply is not None else '', transcript if transcript is not None else '')

            print("\n", response_data.get('usage', {}))
            
            if audio_base64:
                save_and_play_audio(audio_base64, VOICE)
            # use the ID if you really want, I don't
            chat.append({"role": "assistant", "content": reply or transcript or ""})
            user_input = input("\nPrompt: ")
            user_message = {"role": "user", "content": user_input}
            chat.append(user_message)
        else:
            print("No valid response received.")
            ...

save_and_play_audio() does what it says.

CURL is not the right tool…

1 Like