[Realtime API] Agent responding to microphone input that did not become part of transcription

I have noticed curious cases where agent was clearly responding to background dialog (i.e. a podcast interview), but that dialog did not make it to the transcription update returned by server.

It is like AI model has access to unfiltered microphone input and transcription returned is passing additional filter or something. Also it is most definitely reacting to emotion and loudness in voice.

Did somebody else notice something similar? I think a transcription returned to client should be more detailed and include all this details.