Hey! I’ve been struggling for hours to pass a system prompt (text) and user input (audio) to get a text output but it keeps failing. I’m wondering what am I doing wrong.
I’m getting:
{
"error" : {
"param" : "messages.[1].content.[0].input_audio.data",
"message" : "The data provided for 'input_audio' is not of valid mp3 format.",
"code" : "invalid_value",
"type" : "invalid_request_error"
}
}
Only recently found out about the audio as input ability, at DevDay actually and there I was told that of you provide audio as input you get audio as output. So, I’m not sure on that.
You can see it here: https://openai.com/index/introducing-the-realtime-api/
And I think it got updated since:
Audio in the Chat Completions API will be released in the coming weeks, as a new model gpt-4o-audio-preview . With gpt-4o-audio-preview , developers can input text or audio into GPT-4o and receive responses in text, audio, or both.
That’s correct just remove the “audio” from modalities list as well, and you’ll only get the text out.
Here’s a sample python code I wrote as a PoC:
import base64
from openai import OpenAI
client = OpenAI()
def read_audio_file(filepath: str):
with open(filepath, "rb") as audio_file:
return audio_file.read()
# Read and encode the audio data as base64
audio_data = read_audio_file("PATH_TO_AUDIO_INPUT.WAV")
encoded_string = base64.b64encode(audio_data).decode("utf-8")
# Send the encoded audio as base64 string in the `input_audio`
completion = client.chat.completions.create(
model="gpt-4o-audio-preview",
modalities=["text"],
messages=[
{
"role": "system",
"content": "Your job is to transcribe user audio and tell the general tone it is in",
},
{
"role": "user",
"content": [
{
"type": "text",
"text": "What is in this recording?"
},
{
"type": "input_audio",
"input_audio": {
"data": encoded_string, # Use the base64 string here
"format": "wav",
},
}
],
},
],
)
# Print the response
print(completion.choices[0].message.content)
Yes! I’ve done it now and it’s working and fast. My only last challenge is structured output. Currently seems like it’s not possible to get it on a single API request, so I’ll do a first request to get data which is close to the structure I want, then a second call to gpt-4o-mini with structured output and the initial output as input, and hopefully it can transform it quicky.
What do you think?
If you supply a primer then it will follow that, something like
{$Your_prompt}
---------------------------------
Please output your response in JSON format only, no other commentrary, the JSON should be in this structure
{
"some field": "some value",
"some other field": "some other value"
}
---------------------------------
Obviously fill in the JSON struct to whatever you need.
You can also filter the output with regex to rmove ``` back ticks, leading whitespace and trailing whitespace, etc, etc.
I would look closely at the costs and lack of benefit of doing “listening” by GPT-4o-realtime.
When you are paying $100 per million tokens input, using Whisper to get a transcript, and to instead send that to any model of your choice, becomes an easier decision to make.
Then you have no doubt the modality you will get back instead of gambling on realtime where you can’t set it.