Bug Report: Inconsistent speech_started / speech_ended Behavior with Semantic VAD (Realtime API)

Summary:
When using semantic VAD with the Realtime API, multiple consecutive speech_started events are emitted, but only the last one in the burst is followed by a corresponding speech_ended event. This results in merged transcriptions under the final item, even though the earlier speech_started items never receive a speech_ended signal.


Observed Behavior

  • Multiple speech_started events occur in rapid succession.

  • Each speech_started event has a unique item ID.

  • Only the final speech_started event receives a matching speech_ended event.

  • The transcription for the final item includes text that originated from all prior unended segments.

  • The speech_start_time of that final item does not match the true start of the first utterance, even though it contains the earlier speech content.

A detailed sample is attached in semantic_vad_bug.csv.


Expected Behavior

  • Each speech_started event should be paired with a corresponding speech_ended event.

  • Transcriptions should be correctly segmented per item ID and timestamp.

  • The speech_start_time should reflect the actual start of the speech segment.


Notes

  • This issue only occurs when using semantic VAD.

  • With server VAD, the pairing of speech_started and speech_ended events is consistent and accurate.


Questions

  1. Is this behavior expected due to semantic grouping logic in semantic VAD?

  2. Or is there a different handling pattern required for semantic VAD to synchronize speech_started / speech_ended events properly?


Environment

  • Python: 3.11

  • OpenAI SDK: openai = "^1.107.3"

  • Runtime: asyncio

  • API: Realtime API with semantic VAD enabled

    • eagerness="auto"
      interrupt_response=True
      create_response=True
      

Attachments

semantic_vad_bug.csv

Rows labeled [deducted] indicate that ended_at values were manually inferred from the subsequent started_at timestamp.

"started_at","ended_at","item_content","item_id"
"2025-10-21 16:48:23.764045","2025-10-21 16:48:24.659627","[deducted]","item_CT9p2IYEWU0v1lp4ESMVp"
"2025-10-21 16:48:24.659627","2025-10-21 16:48:32.02996","[deducted]","item_CT9p3rlxmZ330QISk0A4O"
"2025-10-21 16:48:32.02996","2025-10-21 16:48:37.596974","Absolutely, thank you for having me. I'm curious about the main goals of this interview and what kind of insights you're hoping to gain. Also, is there anything specific you'd like me to focus on or any particular areas you're most interested in?","item_CT9pA4OGCdcbjle1UgRqw"
"2025-10-21 16:49:06.868876","2025-10-21 16:49:12.085874","[deducted]","item_CT9pj2D8kFKsZdTeeuG6J"
"2025-10-21 16:49:12.085874","2025-10-21 16:49:27.093721","Absolutely, the last time I used a travel app was during a trip to Tokyo. I relied heavily on a navigation app to find public transportation routes and local attractions. I also used a restaurant recommendation app to find places to eat based on my preferences. Overall, it was super helpful in making the trip smooth and enjoyable.","item_CT9powlFFiuniaJV36nyX"
"2025-10-21 16:49:42.453607","2025-10-21 16:50:05.400391","What really stood out was the personalized recommendations. The apps seemed to understand my preferences, like my dietary restrictions and the type of cuisine I enjoy. Also, the real-time updates on public transportation and the ease of booking tickets directly through the app made everything seamless. So, overall, it was the convenience and the intuitive interface that really made the difference.","item_CT9qJLQIXDf93j0ZjPKVS"
"2025-10-21 16:50:22.682255","2025-10-21 16:50:23.341628","[deducted]","item_CT9qxEshZ09AdSBNdh117"
"2025-10-21 16:50:23.341628","2025-10-21 16:50:24.699916","[deducted]","item_CT9qyxCFUX64yLmt6Zomi"
"2025-10-21 16:50:24.699916","2025-10-21 16:50:28.342011","[deducted]","item_CT9qzYssPFAWA6bK90B2T"
"2025-10-21 16:50:28.342011","2025-10-21 16:50:31.315109","[deducted]","item_CT9r3Dur6VqUNQKSn3XM2"
"2025-10-21 16:50:31.315109","2025-10-21 16:50:41.973712","[deducted]","item_CT9r6jgFDjaecn1Zbs63N"
"2025-10-21 16:50:41.973712","2025-10-21 16:50:43.335495","personalization even further, perhaps by learning from past trips and adapting to evolving preferences. It could also offer more proactive suggestions, like alerting me to local events or hidden gems that match my interests. Also, having seamless integration with other services like accommodations","item_CT9rG2UAeeicgEol5xwhb"
"2025-10-21 16:50:42.905257","2025-10-21 16:50:48.059242","Real-time language translation would make it indispensable and truly holistic.","item_CT9rHptB2hxp53ZC9Mt6p"
"2025-10-21 16:51:03.344939","2025-10-21 16:51:04.914048","[deducted]","item_CT9rcfs0MQaai73d4SyjV"
"2025-10-21 16:51:04.914048","2025-10-21 16:51:07.283006","[deducted]","item_CT9rdMXhPgxMyuWY9B5Sl"
"2025-10-21 16:51:07.283006","2025-10-21 16:51:30.073521","[deducted]","item_CT9rgH1AY9Il0pa26UpRP"
"2025-10-21 16:51:30.073521","2025-10-21 16:51:35.987947","[deducted]","item_CT9s2o4jzZpXKFKEQrjDD"
"2025-10-21 16:51:35.987947","2025-10-21 16:51:37.512158","Absolutely. For instance, during my trip to Tokyo, there was one evening when I was looking for a dinner spot, and I had a few places in mind, but it would have been amazing if an AI companion could have suggested a nearby event, like a local festival or a pop-up market that was happening at that exact time. It would have made the experience more immersive. Another example would be if it could integrate with transportation services to suggest alternative routes in real-time if there were delays. Those kinds of practical touches would definitely enhance the trip.","item_CT9s8LduGQDszI3R2V7Sr"
"2025-10-21 16:51:51.826432","2025-10-21 16:51:53.115486","Ka pai.","item_CT9sO5ErIxw5wLRzUlusB"
"2025-10-21 16:52:08.146165","2025-10-21 16:52:12.21159","No, I really have to go. Bye bye, have a great day.","item_CT9seRVp486hIVoyS29ca"

4 Likes

experiecing the same issue here!