We are testing GPT-4o-transcribe with websocket realtime transcription. Recognition is fine, but the conversation.item.input_audio_transcription.delta events are all received only on the end of turn as batch events all at once.
Is it possible to configure the transcription to receive the delta changes on the ongoing basis?
Could turn detection, or other configuration cause this “queuing” of messages?
Unfortunately, delta messages are not of much use with transcription. The endpoint starts the transcription when it receives the commit (either manual or auto generated in case of server VAS or semantic VAD). And when it receives the commit it transcribes the audio buffer in it’s entirety. So you get the complete message as well as all the Delta messages at the same time.
Hey Mathi, thanks for the feedback. If the transcription is done all at once during the commit, what is the purpose of the delta messages?
We are currently using “delta” messages from assembly ai and Azure speech to text to indicate to the user, when speech is being detected. Is there a way to receive some realtime data that we can use to trigger animation when ever speech is detected?
The way the delta messages work currently, I don’t think it is all that useful for transcription. They’re likely keeping the same API interaction pattern for other real-time endpoints (text, audio etc.), where delta messages are probably arriving earlier.
Facing the same issue. Everything described above by mathi is correct. A somewhat hacky solution is described below.
Setup;
OpenAI realtime with Azure (v1 endpoint), targeting gpt-4o-transcribe.
No server_vad enabled, as I have my own VAD (SileroV5). As mentioned, I must send “commit” messages myself.
Problem;
Delta’s are currently irrelevant. GPT-4o-transcribe, even with the Realtime endpoint, does not start transcription before the commit message is sent. Therefore, speech with no long pauses (that is, my VAD does not detect a turn change), results in query time for the transcription to exceed 5s (bad for real-time use cases).
I believe this is the same case, if one is using the server_vad option. That is, the server’s VAD does not recognize a turn change, and thus no automatic commit is sent before the audio buffer becomes large. [Correct me if wrong on this one, have not tested this exact case]
Depending on the usecase, this exact case of a user rambling for long periods of time, may be infrequent, so it might not be a problem for most developers.
Hacky solution;
Send intermediate commits. If a user is rambling for more than X seconds (no breaks in speech), send the commit message, restart the transcription, and repeat.
Moreover, the prompt can be changed during these stops and starts, to help guide the transcription (with previous transcriptions). Obviously, this will effect word error rates, but that is a compromise regarding the increased speed.
Finally, selecting X = 45, seems to keep the processing time below 2s.