Issues with Transcription in Realtime Model Using WebRTC

I’m experiencing difficulties with the transcription feature while utilizing OpenAI’s Realtime API in conjunction with WebRTC for audio-to-audio communication. Despite configuring the session to enable audio transcription, the transcripts received are consistently null.

Configuration Details:

  • Model: gpt-4o-mini-realtime-preview
  • Session Initialization Parameters:
    {
      "model": "gpt-4o-mini-realtime-preview",
      "instructions": "Your prompt here",
      "modalities": ["audio", "text"],
      "input_audio_transcription": {
        "model": "whisper-1"
      },
      "voice": "alloy",
      "input_audio_format": "pcm16",
      "output_audio_format": "pcm16",
      "turn_detection": {
        "type": "server_vad",
        "threshold": 0.5,
        "prefix_padding_ms": 300,
        "silence_duration_ms": 200
      },
      "temperature": 0.8,
      "max_response_output_tokens": 10000
    }
    

Observed Behavior:

  • Upon sending audio input, the conversation.item.created event is triggered with the following payload:
    {
      "type": "conversation.item.created",
      "event_id": "event_Aic5ksNwMPcAhZI5CbHDA",
      "previous_item_id": "item_Aic5bbkDBGqtOSoHFd9Hw",
      "item": {
        "id": "item_Aic5jI1it6HnAiKo6nSZ6",
        "object": "realtime.item",
        "type": "message",
        "status": "completed",
        "role": "user",
        "content": [
          {
            "type": "input_audio",
            "transcript": null
          }
        ]
      }
    }
    
  • The transcript field remains null, indicating that the transcription did not occur as expected.

Troubleshooting Steps Taken:

  1. Audio Input Verification:

    • Confirmed that the audio input is in pcm16 format and adheres to the API’s specifications.
    • Tested the audio input with other transcription services to ensure its clarity and quality.
  2. Session Configuration Review:

    • Ensured that the input_audio_transcription parameter is correctly set to {"model": "whisper-1"} during session initialization.
  3. Event Monitoring:

    • Set up listeners for events such as conversation.item.input_audio_transcription.completed and conversation.item.input_audio_transcription.failed.
    • No transcription.failed events were received, and the transcription.completed events contain null transcripts.
  4. Rate Limits Check:

    • Monitored rate limit updates to ensure that the API usage is within allowed thresholds.
    • Sample log entry:
      {
        "type": "rate_limits.updated",
        "event_id": "event_Aic4uJyq7sQQHdZY1QBbP",
        "rate_limits": [
          {
            "name": "requests",
            "limit": 5000,
            "remaining": 4999,
            "reset_seconds": 0.012
          },
          {
            "name": "tokens",
            "limit": 400000,
            "remaining": 394947,
            "reset_seconds": 0.757
          }
        ]
      }
      

Additional Information:

  • No errors or failure events are reported; the transcripts are simply null.

Request for Assistance:

I would appreciate any guidance or insights into resolving this transcription issue. Specifically:

  • Are there additional configurations required to enable transcription in the Realtime API when using WebRTC?
  • Are there known limitations or issues with the current Realtime API that could be causing this behavior?

Thank you for your support.

I couldn’t get it working either. Ended up using Gladia for transcript, which was sufficient for my use case.

I also mentioned that you dont even get the same properties back after create session request:
if I do POST /v1/realtime/sessions
with body

{
    "model": "gpt-4o-realtime-preview-2024-12-17",
    "input_audio_transcription": {
        "model": "whisper-1"
    }
}

Then I do not get the same input_audio_transcription back in response

{
   ...
    "input_audio_format": "pcm16",
    "output_audio_format": "pcm16",
    "input_audio_transcription": null,
    "tool_choice": "auto",
    "temperature": 0.8,
   ...
}

Solution Found! :tada:

After some additional debugging and testing, I was able to resolve the issue with the transcription returning null in OpenAI’s Realtime API while using WebRTC for audio communication.

The Problem Recap:

Despite setting up the session with the whisper-1 model for transcription in the initialization parameters, the transcripts consistently returned as null.

The Solution:

The issue was resolved by explicitly sending a session.update message after the WebRTC data channel opens. This step ensures that the transcription model is correctly enabled during the session.

Updated Code Example:

{
  'event_id': 'event_123',
  'type': 'session.update',
  'session': {
    'modalities': ['text', 'audio'],
    'instructions': 'prompt',

    'input_audio_format': 'pcm16',
    'output_audio_format': 'pcm16',

    // Explicitly enabling Whisper transcription model
    'input_audio_transcription': {
      'model': 'whisper-1',
    },

    'turn_detection': {
      'type': 'server_vad',
      'threshold': 0.5,
      'prefix_padding_ms': 300,
      'silence_duration_ms': 1000,
      'create_response': true,
    },

    'temperature': 0.8,

    // Correct max token setting
    'max_response_output_tokens': 10000,
  },
}

Key Notes:

  1. The session.update call is triggered after the data channel opens to explicitly enable transcription.
  2. The input_audio_transcription field is explicitly set with the whisper-1 model to activate transcription.
  3. Fixed max_response_output_tokens to 10000 instead of 'inf' to align with API standards.

Final Thoughts:

This approach resolved the issue, and transcripts started coming through as expected. I hope this helps anyone facing similar challenges! :rocket:

Let me know if you need more details or clarification. :blush:

4 Likes

Thanks for the idea @paras_borad , but in the documentation of the open ai realtime session it states that

Maximum number of output tokens for a single assistant response, inclusive of tool calls. Provide an integer between 1 and 4096 to limit output tokens

also I have tried ‘session.update’ with the event_id, which I got from the session.created event, I got an error here is my request body

{
“model”: “gpt-4o-mini-realtime-preview”,
“type”: “session.update”,
“session”: {
“modalities”: [“text”, “audio”],
“instructions”: “prompt”,
“input_audio_format”: “pcm16”,
“output_audio_format”: “pcm16”,
“input_audio_transcription”: {
“model”: “whisper-1”
},
“turn_detection”: {
“type”: “server_vad”,
“threshold”: 0.5,
“prefix_padding_ms”: 300,
“silence_duration_ms”: 1000,
“create_response”: true
},
“temperature”: 0.8
}
}

{
“error”: {
“message”: “Unknown parameter: ‘type’.”,
“type”: “invalid_request_error”,
“param”: “type”,
“code”: “unknown_parameter”
}
}

Can you please help me with this? Where did I made mistake

Hey!

The issue here is that the model parameter should not be included in the session.update request.

Just remove this part:

"model": "gpt-4o-mini-realtime-preview",

and try again.

For more details, you can check the documentation here.

Let me know if it works now! :blush:

1 Like

Hey there, thank you for the response, I have made the changes as you have mentioned


this is the response

{
  "error": {
    "message": "Missing required parameter: 'model'.",
    "type": "invalid_request_error",
    "param": "model",
    "code": "missing_required_parameter"
  }
}

I have also tried without event_id; it gives me the same error. Can you please verify this?

Thanks. My initial turn_detection config was not being listened to either. Running session.update with the expected config after data channel opens solved it.

Hi @maybegrt ,

Could you please give me an overview of how you implement session.update once the data channel opens? When I tried to do it using OpenAI’s /realtime/sessions API, it keeps throwing an error. Additionally, it doesn’t seem to recognize "type": "session.update" when I attempt to update the session.

Your help would mean a lot. Thank you!

It seems that you are trying to use the session.update API call, but based on your provided example, you are attempting to create a session rather than update an existing one.

Important Notes:

  1. Session Update API (session.update) only works when a session is already open and active.
  2. You cannot use the POST request shown in your example to update a session; it only works within an existing data channel.
  3. If your goal is to create a new session, then you should use the session create API instead.

Session Create API Documentation

To create a new session, you should follow the documentation here:
:point_right: Session Create API Documentation

Session Update API Documentation

To update an existing session, check the documentation here:
:point_right: Session Update API Documentation


Key Fix:

If you need to create a session, ensure you pass the required ‘model’ parameter explicitly in the body, as shown in the Session Create API example.

Example for Session Create:

POST https://api.openai.com/v1/realtime/sessions

{
  "model": "gpt-4",
  "instructions": "Your instructions here",
  "modalities": ["audio", "text"],
  "temperature": 0.8
}

If you already have a session open and want to update it, then use the session.update endpoint with the session ID.

Let me know if you need further clarification! :blush:

1 Like

Thank you @paras_borad for such a detailed response, I have finally solved the issue. Thank you for your time.

2 Likes