Is it possible to get both audio and text output from gpt-4o-audio-preview?
For example, I want it to generate question as audio output and its type as text output. How can this be achieved?
Is it possible to get both audio and text output from gpt-4o-audio-preview?
For example, I want it to generate question as audio output and its type as text output. How can this be achieved?
The model will respond how it chooses, either as typical text, or voice with a transcript available.
The “choosing” is based on if you continue a voice only conversation, or if you revert to text inputs in a chat for assistant and user replay.
If, internally, audio modality is one type of “language” the AI can write as tokens, and text-encoded tokens are another type of production you can receive, it is possible that the AI model could generate in “mixed media” (like it would for generating images while talking about them, which has not been released).
However, you cannot train the voice AI in-context with assistant messages, and you cannot even instruct what is going to be produced as a result of your voice or text input, so voice and a different text seems an impossibility.
You’ll likely need to send the user input processed by Whisper and/or AI transcript to another AI classification if you wish to receive a “type” based on that.
Welcome to the dev forum @anteatereater
AFAIK, it’s not possible to get both audio and text at the same time currently. You will get a transcript of the audio response though.
Here’s an example:
import base64
from openai import OpenAI
client = OpenAI()
completion = client.chat.completions.create(
model="gpt-4o-audio-preview",
modalities=["text", "audio"],
audio={"voice": "alloy", "format": "wav"},
messages=[
{
"role": "user",
"content": "Is a golden retriever a good family dog?"
}
]
)
print(completion.choices[0]. message.audio.transcript)
wav_bytes = base64.b64decode(completion.choices[0].message.audio.data)
with open("dog.wav", "wb") as f:
f.write(wav_bytes)