Summary:
When using semantic VAD with the Realtime API, multiple consecutive speech_started events are emitted, but only the last one in the burst is followed by a corresponding speech_ended event. This results in merged transcriptions under the final item, even though the earlier speech_started items never receive a speech_ended signal.
Observed Behavior
-
Multiple
speech_startedevents occur in rapid succession. -
Each
speech_startedevent has a unique item ID. -
Only the final
speech_startedevent receives a matchingspeech_endedevent. -
The transcription for the final item includes text that originated from all prior unended segments.
-
The
speech_start_timeof that final item does not match the true start of the first utterance, even though it contains the earlier speech content.
A detailed sample is attached in semantic_vad_bug.csv.
Expected Behavior
-
Each
speech_startedevent should be paired with a correspondingspeech_endedevent. -
Transcriptions should be correctly segmented per item ID and timestamp.
-
The
speech_start_timeshould reflect the actual start of the speech segment.
Notes
-
This issue only occurs when using semantic VAD.
-
With server VAD, the pairing of
speech_startedandspeech_endedevents is consistent and accurate.
Questions
-
Is this behavior expected due to semantic grouping logic in semantic VAD?
-
Or is there a different handling pattern required for semantic VAD to synchronize
speech_started/speech_endedevents properly?
Environment
-
Python: 3.11
-
OpenAI SDK:
openai = "^1.107.3" -
Runtime:
asyncio -
API: Realtime API with semantic VAD enabled
-
eagerness="auto" interrupt_response=True create_response=True
-
Attachments
semantic_vad_bug.csv
Rows labeled [deducted] indicate that ended_at values were manually inferred from the subsequent started_at timestamp.
"started_at","ended_at","item_content","item_id"
"2025-10-21 16:48:23.764045","2025-10-21 16:48:24.659627","[deducted]","item_CT9p2IYEWU0v1lp4ESMVp"
"2025-10-21 16:48:24.659627","2025-10-21 16:48:32.02996","[deducted]","item_CT9p3rlxmZ330QISk0A4O"
"2025-10-21 16:48:32.02996","2025-10-21 16:48:37.596974","Absolutely, thank you for having me. I'm curious about the main goals of this interview and what kind of insights you're hoping to gain. Also, is there anything specific you'd like me to focus on or any particular areas you're most interested in?","item_CT9pA4OGCdcbjle1UgRqw"
"2025-10-21 16:49:06.868876","2025-10-21 16:49:12.085874","[deducted]","item_CT9pj2D8kFKsZdTeeuG6J"
"2025-10-21 16:49:12.085874","2025-10-21 16:49:27.093721","Absolutely, the last time I used a travel app was during a trip to Tokyo. I relied heavily on a navigation app to find public transportation routes and local attractions. I also used a restaurant recommendation app to find places to eat based on my preferences. Overall, it was super helpful in making the trip smooth and enjoyable.","item_CT9powlFFiuniaJV36nyX"
"2025-10-21 16:49:42.453607","2025-10-21 16:50:05.400391","What really stood out was the personalized recommendations. The apps seemed to understand my preferences, like my dietary restrictions and the type of cuisine I enjoy. Also, the real-time updates on public transportation and the ease of booking tickets directly through the app made everything seamless. So, overall, it was the convenience and the intuitive interface that really made the difference.","item_CT9qJLQIXDf93j0ZjPKVS"
"2025-10-21 16:50:22.682255","2025-10-21 16:50:23.341628","[deducted]","item_CT9qxEshZ09AdSBNdh117"
"2025-10-21 16:50:23.341628","2025-10-21 16:50:24.699916","[deducted]","item_CT9qyxCFUX64yLmt6Zomi"
"2025-10-21 16:50:24.699916","2025-10-21 16:50:28.342011","[deducted]","item_CT9qzYssPFAWA6bK90B2T"
"2025-10-21 16:50:28.342011","2025-10-21 16:50:31.315109","[deducted]","item_CT9r3Dur6VqUNQKSn3XM2"
"2025-10-21 16:50:31.315109","2025-10-21 16:50:41.973712","[deducted]","item_CT9r6jgFDjaecn1Zbs63N"
"2025-10-21 16:50:41.973712","2025-10-21 16:50:43.335495","personalization even further, perhaps by learning from past trips and adapting to evolving preferences. It could also offer more proactive suggestions, like alerting me to local events or hidden gems that match my interests. Also, having seamless integration with other services like accommodations","item_CT9rG2UAeeicgEol5xwhb"
"2025-10-21 16:50:42.905257","2025-10-21 16:50:48.059242","Real-time language translation would make it indispensable and truly holistic.","item_CT9rHptB2hxp53ZC9Mt6p"
"2025-10-21 16:51:03.344939","2025-10-21 16:51:04.914048","[deducted]","item_CT9rcfs0MQaai73d4SyjV"
"2025-10-21 16:51:04.914048","2025-10-21 16:51:07.283006","[deducted]","item_CT9rdMXhPgxMyuWY9B5Sl"
"2025-10-21 16:51:07.283006","2025-10-21 16:51:30.073521","[deducted]","item_CT9rgH1AY9Il0pa26UpRP"
"2025-10-21 16:51:30.073521","2025-10-21 16:51:35.987947","[deducted]","item_CT9s2o4jzZpXKFKEQrjDD"
"2025-10-21 16:51:35.987947","2025-10-21 16:51:37.512158","Absolutely. For instance, during my trip to Tokyo, there was one evening when I was looking for a dinner spot, and I had a few places in mind, but it would have been amazing if an AI companion could have suggested a nearby event, like a local festival or a pop-up market that was happening at that exact time. It would have made the experience more immersive. Another example would be if it could integrate with transportation services to suggest alternative routes in real-time if there were delays. Those kinds of practical touches would definitely enhance the trip.","item_CT9s8LduGQDszI3R2V7Sr"
"2025-10-21 16:51:51.826432","2025-10-21 16:51:53.115486","Ka pai.","item_CT9sO5ErIxw5wLRzUlusB"
"2025-10-21 16:52:08.146165","2025-10-21 16:52:12.21159","No, I really have to go. Bye bye, have a great day.","item_CT9seRVp486hIVoyS29ca"
