Audio support in the Chat Completions API

The Chat Completions API now supports audio inputs and outputs using a new model snapshot: gpt-4o-audio-preview . Based on the same advanced voice model powering the Realtime API, audio support in the Chat Completions API lets you:
.

  • Handle any combination of text and audio: Pass in text, audio, or text and audio and receive responses in both audio and text.
  • Use natural, steerable voices: Similar to the Realtime API, you can use prompting to shape the language, pronunciation, emotional range, and other aspects of the generated audio.
  • Use tool calling: Pass tool definitions and include instructions on tool use in the system prompt, similar to how you would with text in Chat Completions. The output of the tool call will be delivered via text + audio.

This feature is well-suited for asynchronous use cases that don’t require extreme low latencies. For more dynamic and real-time interactions, you should use the Realtime API. To get started, see the guide on audio support in our docs.

I’m excited to hear what you build! :speaking_head: :ear:

24 Likes

I saw the hint of it coming earlier with new usage report with the API response object!

CompletionUsage(completion_tokens=18, prompt_tokens=891, total_tokens=909, completion_tokens_details=CompletionTokensDetails(audio_tokens=None, reasoning_tokens=0), prompt_tokens_details=PromptTokensDetails(audio_tokens=None, cached_tokens=0))

And model-specific pricing is up:

gpt-4o-audio-preview-2024-10-01

Text

in: $2.50 / 1M tokens
out: $10.00 / 1M tokens

Audio

in: $100.00 / 1M tokens
out: $200.00 / 1M tokens

with per minute rates that might adhere more closely with the pricing guide…


Python usage cost metacode
# Define cost constants (prices per token)
TEXT_INPUT_COST_PER_TOKEN = 2.50 / 1_000_000
AUDIO_INPUT_COST_PER_TOKEN = 100.00 / 1_000_000
TEXT_OUTPUT_COST_PER_TOKEN = 10.00 / 1_000_000
AUDIO_OUTPUT_COST_PER_TOKEN = 200.00 / 1_000_000

# Total input and output tokens
prompt_tokens = ...  # Total number of input tokens
completion_tokens = ...  # Total number of output tokens

# Number of audio tokens in input and output
prompt_audio_tokens = ...  # Number of audio input tokens
completion_audio_tokens = ...  # Number of audio output tokens

# Calculate number of text tokens
prompt_text_tokens = prompt_tokens - prompt_audio_tokens
completion_text_tokens = completion_tokens - completion_audio_tokens

# Calculate costs for input tokens
input_text_cost = prompt_text_tokens * TEXT_INPUT_COST_PER_TOKEN
input_audio_cost = prompt_audio_tokens * AUDIO_INPUT_COST_PER_TOKEN

# Calculate costs for output tokens
output_text_cost = completion_text_tokens * TEXT_OUTPUT_COST_PER_TOKEN
output_audio_cost = completion_audio_tokens * AUDIO_OUTPUT_COST_PER_TOKEN

# Total cost calculation
total_cost = (
    input_text_cost +
    input_audio_cost +
    output_text_cost +
    output_audio_cost
)

and all the voices!:

Specifies the voice type. Supported voices are alloy, echo, fable, onyx, nova, and shimmer.


Context Cache support status?

4 Likes

OMG you guys! OMG. LOL

I tried it out.

completion = client.chat.completions.create(
    model='gpt-4o-audio-preview',
    modalities=["text","audio"],
    audio={"voice": "onyx", "format": "wav"},
    messages=[
        {
            "role": "user",
            "content": "In a jaunty American Colorado Mountain Region accent, I'd like you to please introduce yourself. Then, in a slow, brittish drawl, please go full-on thesbian and declaim a paragraph from a famous Shakespeare play."
        }
    ]
)

I made a youtube short from the response, which was about 30 seconds long. (How can we share audio here?) Blew my mind, check it out.

7 Likes

Chat.completions support alloy , echo , fable , onyx , nova , and shimmer. But RealtimeClient only supports alloy, echo and shimmer. Are you planning to support the additional voices in RealtimeClient?

1 Like

Is it possible to do audio-in and audio-out, docs don’t suggest so. But docs also say that only difference between realtime api is that its lower latency.

1 Like

Is it possible to do audio-in and audio-out

Yes. You can pass in audio via the input_audio parameter, and set modalities=["text","audio"].

4 Likes

Also available for Swift developers in Release SwiftOpenAI v3.9.0 · jamesrochabrun/SwiftOpenAI · GitHub

1 Like

I keep getting this error when using the new model:gpt-40-audio-priview:
TypeError: Completions.create() got an unexpected keyword argument ‘modalities’

1 Like

You will need to upgrade your openai python module to the latest, and use the correct methods of the library, as shown in the API reference.

The python library blocks parameters it doesn’t know about.

>>> from openai import Client; c=Client()
>>> r = c.chat.completions.create(model="xx",
...     messages=[{"role": "user", "content": "xx"}],
...     invalidparameter=["text", "audio"])
Traceback (most recent call last):
  File "<pyshell#14>", line 1, in <module>
    r = c.chat.completions.create(model="fjidf",
  File "C:\Program Files\Python311\Lib\site-packages\openai\_utils\_utils.py", line 274, in wrapper
    return func(*args, **kwargs)
TypeError: Completions.create() got an unexpected keyword argument 'invalidparameter'
1 Like

This functionality is also available in Java:

How do you just pull out the text from the response?

A post was split to a new topic: Issues with gpt-4o-audio-preview when using tools/functions