Issues with Transcription in Realtime Model Using WebRTC

I’m experiencing difficulties with the transcription feature while utilizing OpenAI’s Realtime API in conjunction with WebRTC for audio-to-audio communication. Despite configuring the session to enable audio transcription, the transcripts received are consistently null.

Configuration Details:

  • Model: gpt-4o-mini-realtime-preview
  • Session Initialization Parameters:
    {
      "model": "gpt-4o-mini-realtime-preview",
      "instructions": "Your prompt here",
      "modalities": ["audio", "text"],
      "input_audio_transcription": {
        "model": "whisper-1"
      },
      "voice": "alloy",
      "input_audio_format": "pcm16",
      "output_audio_format": "pcm16",
      "turn_detection": {
        "type": "server_vad",
        "threshold": 0.5,
        "prefix_padding_ms": 300,
        "silence_duration_ms": 200
      },
      "temperature": 0.8,
      "max_response_output_tokens": 10000
    }
    

Observed Behavior:

  • Upon sending audio input, the conversation.item.created event is triggered with the following payload:
    {
      "type": "conversation.item.created",
      "event_id": "event_Aic5ksNwMPcAhZI5CbHDA",
      "previous_item_id": "item_Aic5bbkDBGqtOSoHFd9Hw",
      "item": {
        "id": "item_Aic5jI1it6HnAiKo6nSZ6",
        "object": "realtime.item",
        "type": "message",
        "status": "completed",
        "role": "user",
        "content": [
          {
            "type": "input_audio",
            "transcript": null
          }
        ]
      }
    }
    
  • The transcript field remains null, indicating that the transcription did not occur as expected.

Troubleshooting Steps Taken:

  1. Audio Input Verification:

    • Confirmed that the audio input is in pcm16 format and adheres to the API’s specifications.
    • Tested the audio input with other transcription services to ensure its clarity and quality.
  2. Session Configuration Review:

    • Ensured that the input_audio_transcription parameter is correctly set to {"model": "whisper-1"} during session initialization.
  3. Event Monitoring:

    • Set up listeners for events such as conversation.item.input_audio_transcription.completed and conversation.item.input_audio_transcription.failed.
    • No transcription.failed events were received, and the transcription.completed events contain null transcripts.
  4. Rate Limits Check:

    • Monitored rate limit updates to ensure that the API usage is within allowed thresholds.
    • Sample log entry:
      {
        "type": "rate_limits.updated",
        "event_id": "event_Aic4uJyq7sQQHdZY1QBbP",
        "rate_limits": [
          {
            "name": "requests",
            "limit": 5000,
            "remaining": 4999,
            "reset_seconds": 0.012
          },
          {
            "name": "tokens",
            "limit": 400000,
            "remaining": 394947,
            "reset_seconds": 0.757
          }
        ]
      }
      

Additional Information:

  • No errors or failure events are reported; the transcripts are simply null.

Request for Assistance:

I would appreciate any guidance or insights into resolving this transcription issue. Specifically:

  • Are there additional configurations required to enable transcription in the Realtime API when using WebRTC?
  • Are there known limitations or issues with the current Realtime API that could be causing this behavior?

Thank you for your support.

1 Like

I couldn’t get it working either. Ended up using Gladia for transcript, which was sufficient for my use case.

I also mentioned that you dont even get the same properties back after create session request:
if I do POST /v1/realtime/sessions
with body

{
    "model": "gpt-4o-realtime-preview-2024-12-17",
    "input_audio_transcription": {
        "model": "whisper-1"
    }
}

Then I do not get the same input_audio_transcription back in response

{
   ...
    "input_audio_format": "pcm16",
    "output_audio_format": "pcm16",
    "input_audio_transcription": null,
    "tool_choice": "auto",
    "temperature": 0.8,
   ...
}

Solution Found! :tada:

After some additional debugging and testing, I was able to resolve the issue with the transcription returning null in OpenAI’s Realtime API while using WebRTC for audio communication.

The Problem Recap:

Despite setting up the session with the whisper-1 model for transcription in the initialization parameters, the transcripts consistently returned as null.

The Solution:

The issue was resolved by explicitly sending a session.update message after the WebRTC data channel opens. This step ensures that the transcription model is correctly enabled during the session.

Updated Code Example:

{
  'event_id': 'event_123',
  'type': 'session.update',
  'session': {
    'modalities': ['text', 'audio'],
    'instructions': 'prompt',

    'input_audio_format': 'pcm16',
    'output_audio_format': 'pcm16',

    // Explicitly enabling Whisper transcription model
    'input_audio_transcription': {
      'model': 'whisper-1',
    },

    'turn_detection': {
      'type': 'server_vad',
      'threshold': 0.5,
      'prefix_padding_ms': 300,
      'silence_duration_ms': 1000,
      'create_response': true,
    },

    'temperature': 0.8,

    // Correct max token setting
    'max_response_output_tokens': 10000,
  },
}

Key Notes:

  1. The session.update call is triggered after the data channel opens to explicitly enable transcription.
  2. The input_audio_transcription field is explicitly set with the whisper-1 model to activate transcription.
  3. Fixed max_response_output_tokens to 10000 instead of 'inf' to align with API standards.

Final Thoughts:

This approach resolved the issue, and transcripts started coming through as expected. I hope this helps anyone facing similar challenges! :rocket:

Let me know if you need more details or clarification. :blush:

7 Likes

Thanks for the idea @paras_borad , but in the documentation of the open ai realtime session it states that

Maximum number of output tokens for a single assistant response, inclusive of tool calls. Provide an integer between 1 and 4096 to limit output tokens

also I have tried ‘session.update’ with the event_id, which I got from the session.created event, I got an error here is my request body

{
“model”: “gpt-4o-mini-realtime-preview”,
“type”: “session.update”,
“session”: {
“modalities”: [“text”, “audio”],
“instructions”: “prompt”,
“input_audio_format”: “pcm16”,
“output_audio_format”: “pcm16”,
“input_audio_transcription”: {
“model”: “whisper-1”
},
“turn_detection”: {
“type”: “server_vad”,
“threshold”: 0.5,
“prefix_padding_ms”: 300,
“silence_duration_ms”: 1000,
“create_response”: true
},
“temperature”: 0.8
}
}

{
“error”: {
“message”: “Unknown parameter: ‘type’.”,
“type”: “invalid_request_error”,
“param”: “type”,
“code”: “unknown_parameter”
}
}

Can you please help me with this? Where did I made mistake

Hey!

The issue here is that the model parameter should not be included in the session.update request.

Just remove this part:

"model": "gpt-4o-mini-realtime-preview",

and try again.

For more details, you can check the documentation here.

Let me know if it works now! :blush:

1 Like

Hey there, thank you for the response, I have made the changes as you have mentioned


this is the response

{
  "error": {
    "message": "Missing required parameter: 'model'.",
    "type": "invalid_request_error",
    "param": "model",
    "code": "missing_required_parameter"
  }
}

I have also tried without event_id; it gives me the same error. Can you please verify this?

Thanks. My initial turn_detection config was not being listened to either. Running session.update with the expected config after data channel opens solved it.

1 Like

Hi @maybegrt ,

Could you please give me an overview of how you implement session.update once the data channel opens? When I tried to do it using OpenAI’s /realtime/sessions API, it keeps throwing an error. Additionally, it doesn’t seem to recognize "type": "session.update" when I attempt to update the session.

Your help would mean a lot. Thank you!

It seems that you are trying to use the session.update API call, but based on your provided example, you are attempting to create a session rather than update an existing one.

Important Notes:

  1. Session Update API (session.update) only works when a session is already open and active.
  2. You cannot use the POST request shown in your example to update a session; it only works within an existing data channel.
  3. If your goal is to create a new session, then you should use the session create API instead.

Session Create API Documentation

To create a new session, you should follow the documentation here:
:point_right: Session Create API Documentation

Session Update API Documentation

To update an existing session, check the documentation here:
:point_right: Session Update API Documentation


Key Fix:

If you need to create a session, ensure you pass the required ‘model’ parameter explicitly in the body, as shown in the Session Create API example.

Example for Session Create:

POST https://api.openai.com/v1/realtime/sessions

{
  "model": "gpt-4",
  "instructions": "Your instructions here",
  "modalities": ["audio", "text"],
  "temperature": 0.8
}

If you already have a session open and want to update it, then use the session.update endpoint with the session ID.

Let me know if you need further clarification! :blush:

1 Like

Thank you @paras_borad for such a detailed response, I have finally solved the issue. Thank you for your time.

2 Likes

YESSS … is solved … Thanks

@paras_borad are you having to convert the user audio stream to pcm16 for this to work?

Hi .

So i’m trying to build a frontend and backend application with html, css and javascript in the frontend and flask python in the backend.

here’s my backend code.



import os
import httpx
from flask import request ,Flask, render_template, jsonify

# Initialize the Flask application.
app = Flask("voice_app")

# Serve the HTML page at the root route.
@app.route("/")
def index():
    try:
        return render_template("sample.html")
    except Exception as e:
        return "index.html not found", 404

# The /session endpoint
@app.route("/session", methods=["GET"])
def session_endpoint():
    openai_api_key = os.environ.get("OPENAI_API_KEY")
    if not openai_api_key:
        return jsonify({"error": "OPENAI_API_KEY not set"}), 500

    # Make a synchronous POST request to the OpenAI realtime sessions endpoint
    with httpx.Client() as client:
        r = client.post(
            "https://api.openai.com/v1/realtime/sessions",
            headers={
                "Authorization": f"Bearer {openai_api_key}",
                "Content-Type": "application/json",
            },
            json={
                "model": "gpt-4o-realtime-preview-2024-12-17",
                "voice": "verse",
                "instructions": "You are a English Tutor Ria"
            },
        )
        data = r.json()
        print(data)
        return jsonify(data)
    
@app.route("/update_session", methods=["POST"])
def update_session_endpoint():
    # Get the event_id from the request
    request_data = request.get_json()
    event_id = request_data.get("event_id")
    
    if not event_id:
        return jsonify({"error": "event_id is required"}), 400
    
    openai_api_key = os.environ.get("OPENAI_API_KEY")
    if not openai_api_key:
        return jsonify({"error": "OPENAI_API_KEY not set"}), 500
    
    # Make a synchronous POST request to the OpenAI realtime sessions endpoint
    with httpx.Client() as client:
        try:
            r = client.post(
                "https://api.openai.com/v1/realtime/sessions",
                headers={
                    "Authorization": f"Bearer {openai_api_key}",
                    "Content-Type": "application/json",
                },
                json={
                    "type": "session.update",
        "session": {
            "instructions": (
                "your a math tutor alex"
            ),
            "turn_detection": {
                "type": "server_vad",
                "threshold": 0.5,
                "prefix_padding_ms": 300,
                "silence_duration_ms": 500
            },
            "voice": "alloy",
            "temperature": 1,
            "max_response_output_tokens": 4096,
            "modalities": ["text", "audio"],
            "input_audio_format": "pcm16",
            "output_audio_format": "pcm16",
            "input_audio_transcription": {
                "model": "whisper-1"
            },
            "tool_choice": "auto",
            "tools": [
            ]
        }
                }
            )
            r.raise_for_status()  # Raise an exception for HTTP errors
            data = r.json()
            print("Session update response:", data)
            return jsonify({"success": True, "data": data})
        except httpx.HTTPStatusError as e:
            print(f"HTTP error occurred: {e}")
            return jsonify({"error": f"HTTP error: {e.response.status_code}", "details": e.response.text}), e.response.status_code
        except httpx.RequestError as e:
            print(f"Request error occurred: {e}")
            return jsonify({"error": f"Request error: {str(e)}"}), 500
        except Exception as e:
            print(f"Unexpected error: {e}")
            return jsonify({"error": f"Unexpected error: {str(e)}"}), 500    


if __name__ == "__main__":
    # Run the Flask app on port 8116
    app.run(host="0.0.0.0", port=8116, debug=True)

so from the frontend initially i’m hitting the session endpoint and getting the ephemeral token and in the frontend i’m using webrtc connection to pass the sdp and then when i receive the session.created i’m taking the event_id from the response json and passing it back here to send another request to update the session .

but i’m getting an error can you check and tell me what is going wrong here.