Is there a way to prevent gpt-4o-audio-preview from returning audio?

P0mme · November 10, 2024, 3:05pm

Hi there!

I know it’s kind of an odd question but I am actually using gpt-4o-audio-preview like follows:

Single audio file input
Streaming text output

Here is the relevant part of my code:

        response = await aclient.chat.completions.create(
            model="gpt-4o-audio-preview",
            modalities=["text", "audio"],
            audio={"voice": "alloy", "format": "pcm16"},
            messages=[
                {"role": "system", "content": [{"type": "text", "text": "REDACTED PROMPT"}]},
                {"role": "user", "content": [
                    {"type": "input_audio", "input_audio": {
                        "data": encoded_audio,
                        "format": "wav"
                    }}
                ]}
            ],
            stream=True
        )

What happens is, by examining the chunks returned by the API, I noticed that it returns both transcripts (text) chunks AND audio chunks.

Since I got no use for the audio chunks being returned, is there a way to prevent the API from returning them and therefore not paying for them? I’m only interested in the text response.

Thank you!

_j · November 11, 2024, 6:00am

You can give it a few multi-shot turns of user spoken input, and text assistant output. That should be enough to continue on that pattern and get assistant text tokens again.

P0mme · November 11, 2024, 8:59am

Thanks for your answer! Didn’t think about that.

If I feed precious exchanges that are only text outputs from the model in the « multi turn » conversation, it will then stop to return audio and I won’t be billed for audio? Seems too good to be true lol.

I will give it a try. Thanks again

P0mme · November 11, 2024, 1:02pm

Hi again @_j

Unfortunately, this solution does not seem to be working.

I added a couple of fake text exchanges in my completion item creation:

response = await aclient.chat.completions.create(
    model="gpt-4o-audio-preview",
    modalities=["text", "audio"],
    audio={"voice": "alloy", "format": "pcm16"},
    messages=[
        {"role": "system", "content": [{"type": "text", "text": "You are a smart assistant who responds to questions in a natural and intuitive manner. You use a cheerful and warm tone. Your name is Sage. You are female. You respond only with text. You do not generate audio."}]},
        {"role": "user", "content": [
            {"type": "text", "text": "Hello, how are you?"}
        ]},
        {"role": "assistant", "content": [
            {"type": "text", "text": "Hello! I'm happy to see you again. How can I assist you today?"}
        ]},
        {"role": "user", "content": [
            {"type": "input_audio", "input_audio": {
                "data": encoded_audio,
                "format": "wav"
            }}
        ]}
    ],
    stream=True
)

Unfortunately, the model still returns both text AND audio chunks.

Which results in me being billed for the audio (super expensive) even though I don’t want to use it lol.

I really think there’s no way yet to use the completion endpoint with input audio and only text output. Which was my idea to save costs, aka, take the text output, feed it to ElevenLabs, it’ll still be much cheaper than the new audio models from OpenAI.

For more context, I wanted to use the completion endpoint with audio capabilities because I feel like it’s great at processing the audio I send it vis-a-vis pauses, noises, etc. Unlike whisper.

I guess I have no choice but to process user’s audio locally with whisper then feed the text to the completion endpoint without both audio and text capabilities, only with text?

Thanks.

P0mme · November 11, 2024, 1:11pm

Hold on. By removing its audio capabilities, I was able to only get text back, while it still accepted my audio input.

Working on this. Will get back here ASAP.

_j · November 11, 2024, 1:21pm

The idea would be in-context training. This is with chat completions.

The model will respond with audio once if you have only text input. This indicates that there is some priming of the model as assistant to speak in its selected voice.

However, text only with system, user, assistant, user is much harder to get audio out. You’ve show the AI a pattern of responding to text with text, so it’s going to follow that.

The next step, since there is little understanding of “speak” or “write”, is to place user audio recordings and assistant text as prior turns. That’s the pattern you want.

Then you can further pursue a system prompt that instructs that JSON output is mandatory, and text output is only received by an API that validates JSON schema, or similar.

Seems like a lot of work when you can just transcribe with whisper first, though.

P0mme · November 11, 2024, 1:32pm

Hey @_j

Actually, I have just confirmed that, by querying the completion endpoint with only text capability, I am still able to send audio as input. It then responds with text only.

So I’m guessing, by doing that, since audio isn’t generated no more in the response, I am not paying for it. I’m only paying for the $100/million tokens for audio input. I got rid of the $200/million tokens for audio output.

Got rid of 2/3 of the price like that.

Thing is though, I think you’re right: it still makes more sense to use whisper, as it’s so much cheaper.

I’m writing some test code right now to see how slow it is to first use whisper and then completion with text, versus using completion with input audio directly.

My guess is that it’ll be faster when using input audio directly since the audio processing happens on openAI’s side and they must have optimized the shit out of it.

I’ll let you know what I find out.

P0mme · November 11, 2024, 2:11pm

Whisper is actually a much better choice. Delay isn’t too bad, kinda same as when I use the completion with audio capability. I think OpenAI just run whisper on their end when I do that lol.

Thanks a lot for helping out @_j <3 going to close this thread as solved, pointing to the message where I realized you could input audio and force output text only by removing the audio capability (which is counter intuitive imo).

sami.butt07 · December 17, 2024, 11:11am

@_j <3 How about a realtime audio analytics use-case when 2 callers are talking on a call and a realtime sentiment is needed and a summary of a call when the call is terminated? isnt it a better choice since whisper does not support realtime.

Topic		Replies	Views
How can I pass a system prompt and audio user input to get a text output back? API	15	1307	November 3, 2024
How to get text only output from the Realtime API? API api , realtime	14	3464	June 20, 2025
Realtime API re-consuming it's own output audio as input audio API audio , realtime , api-realtime , api-realtime-speech	10	903	January 10, 2025
I don't understand the pricing for the realtime API API realtime	33	15090	October 8, 2024
Waiting for gpt-4o-audio-preview API audio	11	3574	November 4, 2024

Is there a way to prevent gpt-4o-audio-preview from returning audio?

Related topics