A little update: after testing around with codecs, it now looks like I am able to submit g711_alaw audio although the session.updated field is still pcm16. It looks like session.update works as expected (the codec is updated) - only the session.updated field is simply not updated and displays the original value
yeah, same here. Audio input transcriptions still aren’t working. I’m using an Integration with Twilio, and I wonder if that’s part of the problem? I’d be interested to know what the use cases are for the people who it’s working for.
Today some magic happened and I got transcription.
I’m using Java to connect to the realtime API.
Also I have a web client with wav_recorder.js from openai example. Before I have a different recorder implementation. I think the goal is to use 24000 sampleRate.
Here is my session.update. I’m not sure, but I think this “turn_detection”:null also important, if you are not using server VAD.
{"type":"session.update","session":{"modalities":["text","audio"],"input_audio_format":"pcm16","instructions":"Make transcription from my speech","turn_detection":null,"input_audio_transcription":{"model":"whisper-1"}}}
This is my response.create
{"response":{"instructions":"Make transcription from my speech","modalities":["text"]},"type":"response.create"}
And then I got this
{"type":"conversation.item.input_audio_transcription.completed","event_id":"event_ALCCySqEmJFHCYcetC5Ct","item_id":"item_ALCCwjSTw6cvGdZmpyQhC","content_index":0,"transcript":"Hello, how are you?\n"}
Well I am also struggling with audio input transcriptions. I don’t use Twilio or any third party integration. I use server VAD, I don’t get “Enabled” field in session.updated JSON response then as someone said I can confirm the documentation seems not up to date (also for max_response_output_tokens by the way).
By chance, I get a response for event “conversation.item.input_audio_transcription.completed” but the transcript result is totally wrong!