Can GPT-4o analyze audio like it does with pictures?

I know that GPT 4o can analyze images - when you input a given image, it can describe what it “sees”, turn that image into color palette, suggest a new color that will match the image, etc. I’m wondering if it can do the same for audio - e.g. if it can recognise music genres, provide music feedback, etc? I mean I tried to use it this way and it didn’t seem to work.

1 Like

Hi, GPT 4o is “multi-modal” so yes it understands images, audio, and video in theory.

That’s what the “o” is for, “omni.”

But as for it working, I dunno. :person_shrugging: I think ‘audio’ might still be turned into a transcript, so it’s not “true” audio analysis, as in understanding a wave form.

The demos posted on OpenAI’s website and socials when the model was revealed appear to actually use audio as input, seeing as the model can recognise tone and differentiate speakers based on voice alone, which wouldn’t be possible with classic transcription methods (like Whisper).

The extent of its capabilities aren’t fully known yet since the audio input/output modalities still haven’t been released to the public. I’m absolutely looking forward to testing out its musical abilities, although I wouldn’t be surprised if they end up being kinda bad. GPT models have been really bad with anything related to music before, it’s probably not the priority for OpenAI.