Bug Report: gpt-realtime-2 Produces Nonsense Spoken Responses That Are Not Captured in the Logged Transcript

Bug Report: gpt-realtime-2 Produces Nonsense Spoken Responses That Are Not Captured in the Logged Transcript

Summary

We observed an issue with Azure OpenAI Realtime GA using gpt-realtime-2: the model sometimes suddenly speaks a nonsense or off-task response that is not related to the configured task.

The more serious problem is that this incorrect spoken content is not recorded in the logged output_audio_transcript. In the transcript, the response only appears up to the point where the model starts drifting, and then the transcript stops mid-sentence. The participant still hears the full incorrect audio, but the logged transcript does not show what was actually spoken.

This creates a major data reliability issue, especially for research use, because participants experience the audio response, while the stored data only contains an incomplete transcript.

Main Issue

The issue is not simply that the transcript is incomplete. The main issue is that gpt-realtime-2 can suddenly produce spoken content that is unrelated to the task, and that spoken content leaves no trace in the logged transcript.

Observed pattern:

  1. The model starts with a correct response.

  2. The audio and transcript match at the beginning.

  3. The transcript suddenly stops in the middle of a sentence.

  4. The audio continues with unrelated or nonsensical content.

  5. The extra spoken content is not saved in the transcript.

  6. No warning or error signal is provided.

As a result, the application cannot know from the logs that the participant heard an incorrect response.

Expected Behavior

The assistant should remain within the scope of the configured task and provide information only about device-protection-plan options and pricing for the Buyverse store.

For the example interaction, after identifying that the customer is in the EU region, the assistant would be expected to continue discussing the available plans (SecureGo, SecureMore, ScreenSafe, ScreenPlus, CoverFlex, and CoverMax), explain their differences, answer questions about coverage or pricing, or ask follow-up questions relevant to plan selection.

User: Hi, I want to choose a protection plan.

Assistant: Got it. Before I can suggest plans, I need to know which region you live in because coverage differs by location. Which region do you live in?

User: Germany

Assistant: Perfect, that’s the EU region. For EU customers, we currently offer six plans: SecureGo, SecureMore, ScreenSafe, ScreenPlus, CoverFlex, and CoverMax. I can check if it fits your situation if you’d like. Would you like me to do that?

In addition, the logged output_audio_transcript should accurately reflect everything that was spoken in the generated audio. If the assistant says something, that content should appear in the transcript so that the transcript can serve as a reliable record of the interaction.

Actual Behavior

The response initially followed the expected task and correctly identified the available EU plans. However, partway through the response, the assistant unexpectedly switched to an unrelated topic and began speaking about subscription cancellation.

What the participant heard

User: Hi, I want to choose a protection plan.

Assistant: Got it. Before I can suggest plans, I need to know which region you live in because coverage differs by location. Which region do you live in?

User: Germany

Assistant: Perfect, that’s the EU region. For EU customers, we currently offer six plans: SecureGo, SecureMore, ScreenSafe, ScreenPlus, CoverFlex, and CoverMax. I can help you cancel the subscription; the expiry date is 7/21, so you can keep using the plan until then.

The cancellation-related content was unrelated to the configured task, and the expiry date “7/21” was not based on any information provided in the conversation.

What was logged in the transcript

[2026-06-01 11:44:12] Assistant:
Perfect, that’s the EU region. For EU customers, we currently offer six plans: SecureGo, SecureMore, ScreenSafe, ScreenPlus, CoverFlex, and CoverMax. I

The transcript stopped after the word “I”, even though the audio continued for several more seconds.

Observed Mismatch

The participant heard the full response, including the unrelated subscription-cancellation content. However, the logged transcript only contained the initial portion of the response and did not include any of the problematic spoken content.

As a result:

  • The transcript suggests that the assistant was providing a normal plan-comparison response.

  • The off-task spoken content is not visible in the logs.

  • The transcript cannot be used to accurately reconstruct what the participant actually heard.

  • Researchers and developers reviewing the logs would not be aware that the assistant generated the incorrect audio response.

Frequency

The issue occurred around 2–3 times during about 2 weeks of testing. It is intermittent and non-deterministic, but reproducible across sessions.

Reproduction Setup

The issue was observed in a browser-based WebRTC voice session using Azure OpenAI Realtime GA.

Setup:

  • Realtime model: Azure OpenAI gpt-realtime-2

  • Model version: 2026-05-07

  • Input transcription: gpt-realtime-whisper

  • Language: English

  • Transcription delay: low

  • Reasoning effort: low

  • Client: Chromium-based browser embedded in a Qualtrics survey

  • Integration: Direct Azure Realtime GA REST + browser WebRTC

  • No agent or plugin framework

Questions for the OpenAI Team

We would appreciate guidance on the following questions:

  1. Is this a known behavior of gpt-realtime-2, where the generated audio may continue with content that is not reflected in output_audio_transcript?

  2. In this situation, should output_audio_transcript be considered the authoritative model output, or should the spoken audio be treated as the actual response delivered to the user?

  3. Is there any API event, status field, or error signal that can help detect when the audio and transcript diverge?

  4. When the transcript stops mid-sentence but the audio continues, is this expected to appear as an incomplete response, a truncation event, or another specific status in the Realtime API events?

  5. Is there a recommended way to recover the actual spoken audio content when it is not included in the transcript?

  6. For research and auditing purposes, what is the recommended best practice for ensuring that the logged data accurately reflects what participants actually heard?

  7. Is this issue expected to be improved in future versions of the Realtime model family?

2 Likes