Transcribing live call (starting stream on Twilio call), after 1-4 min openAI starts to degrade

Hello. We are experiencing significant performance degradation when using the OpenAI real-time transcription API for call transcriptions.

The issue consistently occurs after 1 to 4 minutes of continuous streaming:

  • Transcriptions start returning with noticeable delays relative to the real-time audio.
  • The latency between spoken sentences and transcription output increases gradually over time.
  • In some cases, the system returns transcriptions for entire segments with a large delay (several seconds after the actual speech).
  • The transcription quality also seems to degrade, with incomplete or contextually incorrect transcriptions appearing after extended usage.
  • This happens regardless of the input audio format (g711_ulaw, sample rate 8000 Hz) and despite the correct configuration of turn_detection (using semantic_vad).

We’ve confirmed that our WebSocket streaming pipeline maintains real-time audio delivery without backlog, and network conditions remain stable throughout the sessions.

Expected Behavior:

  • Real-time transcription output should remain responsive and consistent throughout the entire session, without noticeable latency increase or degradation in accuracy after several minutes of continuous streaming.

Please advise if there are any session duration limitations, internal buffering mechanisms, or recommended practices (such as session renewal strategies) that could mitigate this issue.