Realtime API Audio Analysis Capabilities

Hi there,

I’m curious about how the real-time API handles audio analysis. For use cases like sentiment analysis based on vocal emotion, does the model have access to the entire conversation’s audio history, or does it only process the most recent user message?

For example, let’s say after a few minutes of back-and-forth conversation, we ask the model to evaluate the overall sentiment of the interaction—will it use only the last audio message, or is it able to analyze the full conversation audio context?

Thanks in advance for your help!

A conversation state is maintained for the 30 minute life of a session. It grows in user input and assistant output, as audio encoded for AI understanding.

Thus, the AI can provide understanding-based analysis on a chat conversation to a degree, on language itself, however it doesn’t have the learning and training on “emotions”, beyond counting your laughs or other concrete nuances. You can see in early demos of “hey chat” a half-a-year before release, some deliberate noise-making by OpenAI staffers to get the model to present some understanding based on a follow up question, “yes, you coughed…”.

The audio consumes AI attention in “tokens” at a much higher rate, also, so I would expect poor performance on a “see everything I said” type of task.

You can certainly try it out, however, you will likely discover the AI producing fabrications of what kind of answer would be expected to such a question, if not denials (if you don’t give system messages to override the inability).