Audio support in the Chat Completions API

jeffsharris · October 17, 2024, 6:41pm

The Chat Completions API now supports audio inputs and outputs using a new model snapshot: gpt-4o-audio-preview . Based on the same advanced voice model powering the Realtime API, audio support in the Chat Completions API lets you:
.

Handle any combination of text and audio: Pass in text, audio, or text and audio and receive responses in both audio and text.
Use natural, steerable voices: Similar to the Realtime API, you can use prompting to shape the language, pronunciation, emotional range, and other aspects of the generated audio.
Use tool calling: Pass tool definitions and include instructions on tool use in the system prompt, similar to how you would with text in Chat Completions. The output of the tool call will be delivered via text + audio.

This feature is well-suited for asynchronous use cases that don’t require extreme low latencies. For more dynamic and real-time interactions, you should use the Realtime API. To get started, see the guide on audio support in our docs.

I’m excited to hear what you build!

_j · October 17, 2024, 7:18pm

I saw the hint of it coming earlier with new usage report with the API response object!

CompletionUsage(completion_tokens=18, prompt_tokens=891, total_tokens=909, completion_tokens_details=CompletionTokensDetails(audio_tokens=None, reasoning_tokens=0), prompt_tokens_details=PromptTokensDetails(audio_tokens=None, cached_tokens=0))

And model-specific pricing is up:

gpt-4o-audio-preview-2024-10-01

Text

in: $2.50 / 1M tokens
out: $10.00 / 1M tokens

Audio

in: $100.00 / 1M tokens
out: $200.00 / 1M tokens

with per minute rates that might adhere more closely with the pricing guide…

Python usage cost metacode

# Define cost constants (prices per token)
TEXT_INPUT_COST_PER_TOKEN = 2.50 / 1_000_000
AUDIO_INPUT_COST_PER_TOKEN = 100.00 / 1_000_000
TEXT_OUTPUT_COST_PER_TOKEN = 10.00 / 1_000_000
AUDIO_OUTPUT_COST_PER_TOKEN = 200.00 / 1_000_000

# Total input and output tokens
prompt_tokens = ...  # Total number of input tokens
completion_tokens = ...  # Total number of output tokens

# Number of audio tokens in input and output
prompt_audio_tokens = ...  # Number of audio input tokens
completion_audio_tokens = ...  # Number of audio output tokens

# Calculate number of text tokens
prompt_text_tokens = prompt_tokens - prompt_audio_tokens
completion_text_tokens = completion_tokens - completion_audio_tokens

# Calculate costs for input tokens
input_text_cost = prompt_text_tokens * TEXT_INPUT_COST_PER_TOKEN
input_audio_cost = prompt_audio_tokens * AUDIO_INPUT_COST_PER_TOKEN

# Calculate costs for output tokens
output_text_cost = completion_text_tokens * TEXT_OUTPUT_COST_PER_TOKEN
output_audio_cost = completion_audio_tokens * AUDIO_OUTPUT_COST_PER_TOKEN

# Total cost calculation
total_cost = (
    input_text_cost +
    input_audio_cost +
    output_text_cost +
    output_audio_cost
)

and all the voices!:

Specifies the voice type. Supported voices are alloy, echo, fable, onyx, nova, and shimmer.

Context Cache support status?

thinktank · October 18, 2024, 3:54am

OMG you guys! OMG. LOL

I tried it out.

completion = client.chat.completions.create(
    model='gpt-4o-audio-preview',
    modalities=["text","audio"],
    audio={"voice": "onyx", "format": "wav"},
    messages=[
        {
            "role": "user",
            "content": "In a jaunty American Colorado Mountain Region accent, I'd like you to please introduce yourself. Then, in a slow, brittish drawl, please go full-on thesbian and declaim a paragraph from a famous Shakespeare play."
        }
    ]
)

I made a youtube short from the response, which was about 30 seconds long. (How can we share audio here?) Blew my mind, check it out.

phodaie · October 18, 2024, 4:58am

Chat.completions support alloy , echo , fable , onyx , nova , and shimmer. But RealtimeClient only supports alloy, echo and shimmer. Are you planning to support the additional voices in RealtimeClient?

xRME · October 18, 2024, 5:22am

Is it possible to do audio-in and audio-out, docs don’t suggest so. But docs also say that only difference between realtime api is that its lower latency.

jpvx · October 18, 2024, 6:37am

Is it possible to do audio-in and audio-out

Yes. You can pass in audio via the input_audio parameter, and set modalities=["text","audio"].

jamesrochabrun · October 20, 2024, 4:35am

Also available for Swift developers in Release SwiftOpenAI v3.9.0 · jamesrochabrun/SwiftOpenAI · GitHub

spadesam98 · October 26, 2024, 1:37pm

I keep getting this error when using the new model:gpt-40-audio-priview:
TypeError: Completions.create() got an unexpected keyword argument ‘modalities’

_j · October 26, 2024, 10:08pm

You will need to upgrade your openai python module to the latest, and use the correct methods of the library, as shown in the API reference.

The python library blocks parameters it doesn’t know about.

>>> from openai import Client; c=Client()
>>> r = c.chat.completions.create(model="xx",
...     messages=[{"role": "user", "content": "xx"}],
...     invalidparameter=["text", "audio"])

Traceback (most recent call last):
  File "<pyshell#14>", line 1, in <module>
    r = c.chat.completions.create(model="fjidf",
  File "C:\Program Files\Python311\Lib\site-packages\openai\_utils\_utils.py", line 274, in wrapper
    return func(*args, **kwargs)
TypeError: Completions.create() got an unexpected keyword argument 'invalidparameter'

sashirestela · October 27, 2024, 2:27am

This functionality is also available in Java:

spadesam98 · October 27, 2024, 3:30pm

How do you just pull out the text from the response?

sps · November 5, 2024, 4:44pm

A post was split to a new topic: Issues with gpt-4o-audio-preview when using tools/functions

Khantiko · December 12, 2024, 8:08am

Hi, I would like to try out a guide from an OpenAI cookbook (titled: Voice Translation to Different Languages) to translate a talk on meditation to different languages. Is the model ‘gpt-4o-audio-preview’ available for ChatGPT Plus users, or is it only available to select beta users?
I attempted the code in Colab but could not access it with my OpenAI API.

Topic		Replies	Views
Waiting for gpt-4o-audio-preview API audio	11	2485	November 4, 2024
Speech-to-Speech (Audio Input/Output) with 4o API	5	659	October 13, 2024
No access to 'gpt-4o-audio-preview' model API gpt-4o-audio-preview	2	101	December 12, 2024
GPT-4o Audio Access for API API gpt-4o	28	32183	December 13, 2024
How can I pass a system prompt and audio user input to get a text output back? API	15	612	November 3, 2024

Audio support in the Chat Completions API

Related topics