I’m experiencing difficulties with the transcription feature while utilizing OpenAI’s Realtime API in conjunction with WebRTC for audio-to-audio communication. Despite configuring the session to enable audio transcription, the transcripts received are consistently null
.
Configuration Details:
- Model:
gpt-4o-mini-realtime-preview
- Session Initialization Parameters:
{ "model": "gpt-4o-mini-realtime-preview", "instructions": "Your prompt here", "modalities": ["audio", "text"], "input_audio_transcription": { "model": "whisper-1" }, "voice": "alloy", "input_audio_format": "pcm16", "output_audio_format": "pcm16", "turn_detection": { "type": "server_vad", "threshold": 0.5, "prefix_padding_ms": 300, "silence_duration_ms": 200 }, "temperature": 0.8, "max_response_output_tokens": 10000 }
Observed Behavior:
- Upon sending audio input, the
conversation.item.created
event is triggered with the following payload:{ "type": "conversation.item.created", "event_id": "event_Aic5ksNwMPcAhZI5CbHDA", "previous_item_id": "item_Aic5bbkDBGqtOSoHFd9Hw", "item": { "id": "item_Aic5jI1it6HnAiKo6nSZ6", "object": "realtime.item", "type": "message", "status": "completed", "role": "user", "content": [ { "type": "input_audio", "transcript": null } ] } }
- The
transcript
field remainsnull
, indicating that the transcription did not occur as expected.
Troubleshooting Steps Taken:
-
Audio Input Verification:
- Confirmed that the audio input is in
pcm16
format and adheres to the API’s specifications. - Tested the audio input with other transcription services to ensure its clarity and quality.
- Confirmed that the audio input is in
-
Session Configuration Review:
- Ensured that the
input_audio_transcription
parameter is correctly set to{"model": "whisper-1"}
during session initialization.
- Ensured that the
-
Event Monitoring:
- Set up listeners for events such as
conversation.item.input_audio_transcription.completed
andconversation.item.input_audio_transcription.failed
. - No
transcription.failed
events were received, and thetranscription.completed
events containnull
transcripts.
- Set up listeners for events such as
-
Rate Limits Check:
- Monitored rate limit updates to ensure that the API usage is within allowed thresholds.
- Sample log entry:
{ "type": "rate_limits.updated", "event_id": "event_Aic4uJyq7sQQHdZY1QBbP", "rate_limits": [ { "name": "requests", "limit": 5000, "remaining": 4999, "reset_seconds": 0.012 }, { "name": "tokens", "limit": 400000, "remaining": 394947, "reset_seconds": 0.757 } ] }
Additional Information:
- No errors or failure events are reported; the transcripts are simply
null
.
Request for Assistance:
I would appreciate any guidance or insights into resolving this transcription issue. Specifically:
- Are there additional configurations required to enable transcription in the Realtime API when using WebRTC?
- Are there known limitations or issues with the current Realtime API that could be causing this behavior?
Thank you for your support.