How can I pass a system prompt and audio user input to get a text output back?

Hey! I’ve been struggling for hours to pass a system prompt (text) and user input (audio) to get a text output but it keeps failing. I’m wondering what am I doing wrong.

I’m getting:

{
  "error" : {
    "param" : "messages.[1].content.[0].input_audio.data",
    "message" : "The data provided for 'input_audio' is not of valid mp3 format.",
    "code" : "invalid_value",
    "type" : "invalid_request_error"
  }
}

I’m using simple code to convert it to base64.

Here is my request:

 guard let url = URL(string: "https://api.openai.com/v1/chat/completions") else {
            throw AnalyzeServiceError.invalidURL
        }

        // Create the request
        var request = URLRequest(url: url)
        request.httpMethod = "POST"

        // Add headers
        request.setValue("Bearer \(apiKey)", forHTTPHeaderField: "Authorization")
        request.setValue("application/json", forHTTPHeaderField: "Content-Type")

        
        guard var base64Voice = convertAudioToBase64(fileURL: url) else {
            throw AnalyzeServiceError.noAudioData
        }
        base64Voice = base64Voice.trimmingCharacters(in: .whitespacesAndNewlines)

        // Create the request body
        let requestBody: [String: Any] = [
            "model": "gpt-4o-audio-preview",
            "modalities": ["text", "audio"],
            "audio": ["voice": "alloy", "format": "mp3"],
            "messages": [
                ["role": "system", "content": "You are a sound expert. Analyze user input audio."],
                [
                    "role": "user",
                    "content": [
                        [
                            "type": "input_audio",
                            "input_audio": [
                                "data": base64Voice,
                                "format": "mp3"
                            ]
                        ]
                    ]
                ]
            ],
            "temperature": 0.0
        ]
    func convertAudioToBase64(fileURL: URL) -> String? {
        do {
            // Load the MP3 audio file data
            let audioData = try Data(contentsOf: fileURL)
            
            // Convert to base64
            let base64String = audioData.base64EncodedString()
            
            return base64String
        } catch {
            print("Error loading audio file: \(error.localizedDescription)")
            return nil
        }
    }

Any clue what is it failing?

Perhaps I’m reading the code wrong, but you seem to be sending the model its own output as input before that output has been created.

The url points to the completions endpoint and not a valid mp3 file, did you mean to do that?

2 Likes

Hi @admin215

If you want just to get a text output back then you should remove this line:

Additionally, the error says that the input audio is not in valid mp3 format. If I were you, I’d be looking into fixing that,

2 Likes

Hey @Foxalabs

I basically tried to replicate what I saw in the docs:
https://platform.openai.com/docs/guides/audio?lang=curl&audio-generation-quickstart-example=audio-in
I assumed I can return text (I saw something in the OpenAI blog post)

When I remove this line i get:

{
  "error" : {
    "param" : "modalities[1]",
    "message" : "`audio` modality requires an `audio` output configuration.",
    "code" : "missing_audio",
    "type" : "invalid_request_error"
  }
}

Only recently found out about the audio as input ability, at DevDay actually and there I was told that of you provide audio as input you get audio as output. So, I’m not sure on that.

1 Like

You can see it here:
https://openai.com/index/introducing-the-realtime-api/
And I think it got updated since:
Audio in the Chat Completions API will be released in the coming weeks, as a new model gpt-4o-audio-preview . With gpt-4o-audio-preview , developers can input text or audio into GPT-4o and receive responses in text, audio, or both.

Right, but you seem to be trying to use the models own output as input, that’s a circular reference that can’t work.

guard var base64Voice = convertAudioToBase64(fileURL: url)

and

guard let url = URL(string: "https://api.openai.com/v1/chat/completions")

The output from that endpoint might indeed be mp3, but it won’t be there prior to making the API call… so the input will be invalid.

3 Likes

Wow @Foxalabs! I can’t believe i missed it. You are correct, it’s now working.
I would never have caught it haha, thank you so much,

Thank you so much!!

1 Like

@Foxalabs , to verify, this won’t support structured outputs, right? I’m trying and it’s failing.

That’s correct just remove the “audio” from modalities list as well, and you’ll only get the text out.

Here’s a sample python code I wrote as a PoC:

import base64
from openai import OpenAI

client = OpenAI()

def read_audio_file(filepath: str):
    with open(filepath, "rb") as audio_file:
        return audio_file.read()

# Read and encode the audio data as base64
audio_data = read_audio_file("PATH_TO_AUDIO_INPUT.WAV")
encoded_string = base64.b64encode(audio_data).decode("utf-8")

# Send the encoded audio as base64 string in the `input_audio`
completion = client.chat.completions.create(
    model="gpt-4o-audio-preview",
    modalities=["text"],
    messages=[
        {
            "role": "system",
            "content": "Your job is to transcribe user audio and tell the general tone it is in",
        },
        {
            "role": "user",
            "content": [
                { 
                    "type": "text",
                    "text": "What is in this recording?"
                },
                {
                    "type": "input_audio",
                    "input_audio": {
                        "data": encoded_string,  # Use the base64 string here
                        "format": "wav",
                    },
                }
            ],
        },
    ],
)

# Print the response
print(completion.choices[0].message.content)
2 Likes

Yes! I’ve done it now and it’s working and fast. My only last challenge is structured output. Currently seems like it’s not possible to get it on a single API request, so I’ll do a first request to get data which is close to the structure I want, then a second call to gpt-4o-mini with structured output and the initial output as input, and hopefully it can transform it quicky.
What do you think?

Yes, structured outputs aren’t supported currently on the audio-preview model. Currently the best it can do is tool calls.

1 Like

If you supply a primer then it will follow that, something like

{$Your_prompt}

---------------------------------

Please output your response in JSON format only, no other commentrary, the JSON should be in this structure

{
    "some field": "some value", 
    "some other field": "some other value"
}

---------------------------------

Obviously fill in the JSON struct to whatever you need.

You can also filter the output with regex to rmove ``` back ticks, leading whitespace and trailing whitespace, etc, etc.

2 Likes

I would look closely at the costs and lack of benefit of doing “listening” by GPT-4o-realtime.

When you are paying $100 per million tokens input, using Whisper to get a transcript, and to instead send that to any model of your choice, becomes an easier decision to make.

Then you have no doubt the modality you will get back instead of gambling on realtime where you can’t set it.

The issue is that I need to analyze actual voice data, seems like that’s the only way…