Is there a way to prevent gpt-4o-audio-preview from returning audio?

Hi there!

I know it’s kind of an odd question but I am actually using gpt-4o-audio-preview like follows:

  • Single audio file input
  • Streaming text output

Here is the relevant part of my code:

        response = await aclient.chat.completions.create(
            model="gpt-4o-audio-preview",
            modalities=["text", "audio"],
            audio={"voice": "alloy", "format": "pcm16"},
            messages=[
                {"role": "system", "content": [{"type": "text", "text": "REDACTED PROMPT"}]},
                {"role": "user", "content": [
                    {"type": "input_audio", "input_audio": {
                        "data": encoded_audio,
                        "format": "wav"
                    }}
                ]}
            ],
            stream=True
        )

What happens is, by examining the chunks returned by the API, I noticed that it returns both transcripts (text) chunks AND audio chunks.

Since I got no use for the audio chunks being returned, is there a way to prevent the API from returning them and therefore not paying for them? I’m only interested in the text response.

Thank you!

You can give it a few multi-shot turns of user spoken input, and text assistant output. That should be enough to continue on that pattern and get assistant text tokens again.

Thanks for your answer! Didn’t think about that.

If I feed precious exchanges that are only text outputs from the model in the « multi turn » conversation, it will then stop to return audio and I won’t be billed for audio? Seems too good to be true lol.

I will give it a try. Thanks again

1 Like

Hi again @_j :slight_smile:

Unfortunately, this solution does not seem to be working.

I added a couple of fake text exchanges in my completion item creation:

response = await aclient.chat.completions.create(
    model="gpt-4o-audio-preview",
    modalities=["text", "audio"],
    audio={"voice": "alloy", "format": "pcm16"},
    messages=[
        {"role": "system", "content": [{"type": "text", "text": "You are a smart assistant who responds to questions in a natural and intuitive manner. You use a cheerful and warm tone. Your name is Sage. You are female. You respond only with text. You do not generate audio."}]},
        {"role": "user", "content": [
            {"type": "text", "text": "Hello, how are you?"}
        ]},
        {"role": "assistant", "content": [
            {"type": "text", "text": "Hello! I'm happy to see you again. How can I assist you today?"}
        ]},
        {"role": "user", "content": [
            {"type": "input_audio", "input_audio": {
                "data": encoded_audio,
                "format": "wav"
            }}
        ]}
    ],
    stream=True
)

Unfortunately, the model still returns both text AND audio chunks.

Which results in me being billed for the audio (super expensive) even though I don’t want to use it lol.

I really think there’s no way yet to use the completion endpoint with input audio and only text output. Which was my idea to save costs, aka, take the text output, feed it to ElevenLabs, it’ll still be much cheaper than the new audio models from OpenAI.

For more context, I wanted to use the completion endpoint with audio capabilities because I feel like it’s great at processing the audio I send it vis-a-vis pauses, noises, etc. Unlike whisper.

I guess I have no choice but to process user’s audio locally with whisper then feed the text to the completion endpoint without both audio and text capabilities, only with text?

Thanks.

Hold on. By removing its audio capabilities, I was able to only get text back, while it still accepted my audio input.

Working on this. Will get back here ASAP.

The idea would be in-context training. This is with chat completions.

The model will respond with audio once if you have only text input. This indicates that there is some priming of the model as assistant to speak in its selected voice.

However, text only with system, user, assistant, user is much harder to get audio out. You’ve show the AI a pattern of responding to text with text, so it’s going to follow that.

The next step, since there is little understanding of “speak” or “write”, is to place user audio recordings and assistant text as prior turns. That’s the pattern you want.

Then you can further pursue a system prompt that instructs that JSON output is mandatory, and text output is only received by an API that validates JSON schema, or similar.

Seems like a lot of work when you can just transcribe with whisper first, though.

Hey @_j

Actually, I have just confirmed that, by querying the completion endpoint with only text capability, I am still able to send audio as input. It then responds with text only.

So I’m guessing, by doing that, since audio isn’t generated no more in the response, I am not paying for it. I’m only paying for the $100/million tokens for audio input. I got rid of the $200/million tokens for audio output.

Got rid of 2/3 of the price like that.

Thing is though, I think you’re right: it still makes more sense to use whisper, as it’s so much cheaper.

I’m writing some test code right now to see how slow it is to first use whisper and then completion with text, versus using completion with input audio directly.

My guess is that it’ll be faster when using input audio directly since the audio processing happens on openAI’s side and they must have optimized the shit out of it.

I’ll let you know what I find out.

Whisper is actually a much better choice. Delay isn’t too bad, kinda same as when I use the completion with audio capability. I think OpenAI just run whisper on their end when I do that lol.

Thanks a lot for helping out @_j <3 going to close this thread as solved, pointing to the message where I realized you could input audio and force output text only by removing the audio capability (which is counter intuitive imo).

@_j <3 How about a realtime audio analytics use-case when 2 callers are talking on a call and a realtime sentiment is needed and a summary of a call when the call is terminated? isnt it a better choice since whisper does not support realtime.