as mentioned here, you can leave out the âenabledâ key which resolved it for some users. For me however, this didnât work but maybe you will have more luck. I really need the transcription as well.
same problem âtranscriptâ is empty. (tested with webRTC the 14/02/2025)
I have put in the session cretation:
âinput_audio_transcriptionâ: {
âmodelâ: âwhisper-1â,
âlanguageâ: âfrâ
},
I have: in the message âconversation.item.createdâ
Ihave the same problem âtranscriptâ is empty. (tested with webRTC the 14/02/2025)
I have put in the session cretation:
âinput_audio_transcriptionâ: {
âmodelâ: âwhisper-1â,
âlanguageâ: âfrâ
},
I have: in the message âconversation.item.createdâ
Click on the demo there and check the âTranscribe User Audioâ and talk and youâll see events come back with transcriptions.
A couple things that I have noticed along the way:
Make sure when you do the session request to include the input_audio_transcription field as part of the session request to get the ephemeral token. If you do not do it there, you have to send a separate session.update client event to update it with the transcriptions. This follow up session.update event will work - I use it in the example in fact.
Be careful about background noise. Sometimes the Realtime API will respond to speech, but the transcriptions are wrong or blank because the Whisper-1 model used for transcriptions doesnât interpret the speech the same way that the Realtime API model does.
Iâve only really tried this in English, although the transcriptions are really just from Whisper so anything that works with Whisper should work for transcriptions with the Realtime API too.
Anyone having issues with the Assistant (AI) transcript?
Iâm using the Realtime API with modalities: [âtextâ, âaudioâ] and sending a session.update immediately after the data channel opens to confirm the modalities.
The session is created successfully with both text and audio modalities confirmed in the payload but during the session:
I only receive audio events (response.audio.done, etc.).
I never receive any response.text.delta, response.text.done or response.output_item.added events containing assistant text.
this happens even when the AI says full sentences â not just tiny utterances.
no response.content_part.added or text delta events either.
Iâve checked everything on my end.. the connection is healthy and stays open.
session.update is acknowledged successfully.
Model used: gpt-4o-realtime-preview-2024-12-17. Prompts are simple and clean. This happens consistently across dozens of sessions.
questions:
Is this a known issue?
Are there any specific conditions under which the Realtime API would suppress text output entirely while streaming audio? For eg does function calling block assistant transcripts from coming in?
Hey Scott, went through your repo and it looks like you are just doing a session update to get user input audio transcriptions. However, I still am not able to get it to work. My conversation items have ânullâ transcript content and i never receive âconversation.item.input_audio_transcription.completedâ messages from the server. Any suggestions?