I’m experiencing difficulties with the transcription feature while utilizing OpenAI’s Realtime API in conjunction with WebRTC for audio-to-audio communication. Despite configuring the session to enable audio transcription, the transcripts received are consistently null
.
Configuration Details:
- Model:
gpt-4o-mini-realtime-preview
- Session Initialization Parameters:
{
"model": "gpt-4o-mini-realtime-preview",
"instructions": "Your prompt here",
"modalities": ["audio", "text"],
"input_audio_transcription": {
"model": "whisper-1"
},
"voice": "alloy",
"input_audio_format": "pcm16",
"output_audio_format": "pcm16",
"turn_detection": {
"type": "server_vad",
"threshold": 0.5,
"prefix_padding_ms": 300,
"silence_duration_ms": 200
},
"temperature": 0.8,
"max_response_output_tokens": 10000
}
Observed Behavior:
- Upon sending audio input, the
conversation.item.created
event is triggered with the following payload:{
"type": "conversation.item.created",
"event_id": "event_Aic5ksNwMPcAhZI5CbHDA",
"previous_item_id": "item_Aic5bbkDBGqtOSoHFd9Hw",
"item": {
"id": "item_Aic5jI1it6HnAiKo6nSZ6",
"object": "realtime.item",
"type": "message",
"status": "completed",
"role": "user",
"content": [
{
"type": "input_audio",
"transcript": null
}
]
}
}
- The
transcript
field remains null
, indicating that the transcription did not occur as expected.
Troubleshooting Steps Taken:
-
Audio Input Verification:
- Confirmed that the audio input is in
pcm16
format and adheres to the API’s specifications.
- Tested the audio input with other transcription services to ensure its clarity and quality.
-
Session Configuration Review:
- Ensured that the
input_audio_transcription
parameter is correctly set to {"model": "whisper-1"}
during session initialization.
-
Event Monitoring:
- Set up listeners for events such as
conversation.item.input_audio_transcription.completed
and conversation.item.input_audio_transcription.failed
.
- No
transcription.failed
events were received, and the transcription.completed
events contain null
transcripts.
-
Rate Limits Check:
- Monitored rate limit updates to ensure that the API usage is within allowed thresholds.
- Sample log entry:
{
"type": "rate_limits.updated",
"event_id": "event_Aic4uJyq7sQQHdZY1QBbP",
"rate_limits": [
{
"name": "requests",
"limit": 5000,
"remaining": 4999,
"reset_seconds": 0.012
},
{
"name": "tokens",
"limit": 400000,
"remaining": 394947,
"reset_seconds": 0.757
}
]
}
Additional Information:
- No errors or failure events are reported; the transcripts are simply
null
.
Request for Assistance:
I would appreciate any guidance or insights into resolving this transcription issue. Specifically:
- Are there additional configurations required to enable transcription in the Realtime API when using WebRTC?
- Are there known limitations or issues with the current Realtime API that could be causing this behavior?
Thank you for your support.
I couldn’t get it working either. Ended up using Gladia for transcript, which was sufficient for my use case.
I also mentioned that you dont even get the same properties back after create session request:
if I do POST /v1/realtime/sessions
with body
{
"model": "gpt-4o-realtime-preview-2024-12-17",
"input_audio_transcription": {
"model": "whisper-1"
}
}
Then I do not get the same input_audio_transcription back in response
{
...
"input_audio_format": "pcm16",
"output_audio_format": "pcm16",
"input_audio_transcription": null,
"tool_choice": "auto",
"temperature": 0.8,
...
}
Solution Found!
After some additional debugging and testing, I was able to resolve the issue with the transcription returning null
in OpenAI’s Realtime API while using WebRTC for audio communication.
The Problem Recap:
Despite setting up the session with the whisper-1
model for transcription in the initialization parameters, the transcripts consistently returned as null
.
The Solution:
The issue was resolved by explicitly sending a session.update
message after the WebRTC data channel opens. This step ensures that the transcription model is correctly enabled during the session.
Updated Code Example:
{
'event_id': 'event_123',
'type': 'session.update',
'session': {
'modalities': ['text', 'audio'],
'instructions': 'prompt',
'input_audio_format': 'pcm16',
'output_audio_format': 'pcm16',
// Explicitly enabling Whisper transcription model
'input_audio_transcription': {
'model': 'whisper-1',
},
'turn_detection': {
'type': 'server_vad',
'threshold': 0.5,
'prefix_padding_ms': 300,
'silence_duration_ms': 1000,
'create_response': true,
},
'temperature': 0.8,
// Correct max token setting
'max_response_output_tokens': 10000,
},
}
Key Notes:
- The
session.update
call is triggered after the data channel opens to explicitly enable transcription.
- The
input_audio_transcription
field is explicitly set with the whisper-1
model to activate transcription.
- Fixed
max_response_output_tokens
to 10000
instead of 'inf'
to align with API standards.
Final Thoughts:
This approach resolved the issue, and transcripts started coming through as expected. I hope this helps anyone facing similar challenges!
Let me know if you need more details or clarification.
4 Likes
Thanks for the idea @paras_borad , but in the documentation of the open ai realtime session it states that
Maximum number of output tokens for a single assistant response, inclusive of tool calls. Provide an integer between 1 and 4096 to limit output tokens
also I have tried ‘session.update’ with the event_id, which I got from the session.created event, I got an error here is my request body
{
“model”: “gpt-4o-mini-realtime-preview”,
“type”: “session.update”,
“session”: {
“modalities”: [“text”, “audio”],
“instructions”: “prompt”,
“input_audio_format”: “pcm16”,
“output_audio_format”: “pcm16”,
“input_audio_transcription”: {
“model”: “whisper-1”
},
“turn_detection”: {
“type”: “server_vad”,
“threshold”: 0.5,
“prefix_padding_ms”: 300,
“silence_duration_ms”: 1000,
“create_response”: true
},
“temperature”: 0.8
}
}
{
“error”: {
“message”: “Unknown parameter: ‘type’.”,
“type”: “invalid_request_error”,
“param”: “type”,
“code”: “unknown_parameter”
}
}
Can you please help me with this? Where did I made mistake
Hey!
The issue here is that the model
parameter should not be included in the session.update
request.
Just remove this part:
"model": "gpt-4o-mini-realtime-preview",
and try again.
For more details, you can check the documentation here.
Let me know if it works now!
1 Like
Hey there, thank you for the response, I have made the changes as you have mentioned
this is the response
{
"error": {
"message": "Missing required parameter: 'model'.",
"type": "invalid_request_error",
"param": "model",
"code": "missing_required_parameter"
}
}
I have also tried without event_id; it gives me the same error. Can you please verify this?
Thanks. My initial turn_detection config was not being listened to either. Running session.update with the expected config after data channel opens solved it.
Hi @maybegrt ,
Could you please give me an overview of how you implement session.update
once the data channel opens? When I tried to do it using OpenAI’s /realtime/sessions
API, it keeps throwing an error. Additionally, it doesn’t seem to recognize "type": "session.update"
when I attempt to update the session.
Your help would mean a lot. Thank you!
It seems that you are trying to use the session.update API call, but based on your provided example, you are attempting to create a session rather than update an existing one.
Important Notes:
- Session Update API (
session.update
) only works when a session is already open and active.
- You cannot use the POST request shown in your example to update a session; it only works within an existing data channel.
- If your goal is to create a new session, then you should use the session create API instead.
Session Create API Documentation
To create a new session, you should follow the documentation here:
Session Create API Documentation
Session Update API Documentation
To update an existing session, check the documentation here:
Session Update API Documentation
Key Fix:
If you need to create a session, ensure you pass the required ‘model’ parameter explicitly in the body, as shown in the Session Create API example.
Example for Session Create:
POST https://api.openai.com/v1/realtime/sessions
{
"model": "gpt-4",
"instructions": "Your instructions here",
"modalities": ["audio", "text"],
"temperature": 0.8
}
If you already have a session open and want to update it, then use the session.update endpoint with the session ID.
Let me know if you need further clarification!
1 Like
Thank you @paras_borad for such a detailed response, I have finally solved the issue. Thank you for your time.
2 Likes