Can GPT-4o directly analysis audio not depend on transcript?

I’d like to use GPT-4o to analysis audio, and based my prompt. like what’s the duration of the audio, how many speaker in the audio, any background in the video … is that GPT-4o-audio-preview has the ability? or it just two step not one-shot that first turn to transcript, and than analysis the text.
If GPT-4o-audio-preview can do, dose GPT-4o-realtime-preview has the same ability?

1 Like

I can only speak for the realtime api as I haven’t used GPT-4o-audio.

The realtime API is NOT able to tell you how long an audio is, how many speakers there are or whether or not there is background noise.
There is no text involved with the model, it’s true audio-to-audio.

Hope that helps! :hugs:

Thanks. I also tried the realtime model and API. My experience is that I can send audio and text, and the text can provide information from the audio but can’t directly analyze the audio. It seems like a two-step process rather than a one-shot solution.

I’m running GPT on internal resources that only have the realtime-preview model, not the audio-preview model still now. Most likely, the realtime-preview model can’t do what I want. I’m not sure whether the audio-preview model has the capability.

1 Like