I managed to get the realtime api working as a conversation pretty easily but I cannot get the event conversation.item.input_audio_transcription.completed
to fire. I don’t get any events at all back while the user is talking other than input_audio_buffer.speech_started
. Is this because the audio from getUserMedia
is opus? Do I need to convert it to PCM16 for the whisper ai transcription to work? There are no bugs visible so it’s difficult to ascertain.
This is my session creation code happening on the node server via which my React client instantiates the session:
await fetch('https://api.openai.com/v1/realtime/sessions', {
method: 'POST',
headers: {
Authorization: `Bearer ${process.env.OPENAI_API_KEY}`,
'Content-Type': 'application/json',
},
body: JSON.stringify({
model: 'gpt-4o-realtime-preview-2024-12-17',
instructions: 'You are a friendly chap called bob',
modalities: ['audio', 'text'],
input_audio_format: 'pcm16', // the possible options here imply it should be re-encoded
input_audio_transcription: {
model: 'whisper-1',
},
turn_detection: {
type: 'server_vad',
threshold: 0.5,
prefix_padding_ms: 300,
silence_duration_ms: 1000,
},
temperature: 0.8,
voice: 'verse',
}),
});```