Error: "Unknown parameter: 'session'" when using OpenAI Realtime API

Hello,
I am currently developing a real-time speech-to-text application using OpenAI’s Realtime API. However, I keep encountering the following error when sending audio data to the API:

{
  type: 'error',
  event_id: 'event_AWc6bRvTpA5q4zzyFfByl',
  error: {
    type: 'invalid_request_error',
    code: 'unknown_parameter',
    message: "Unknown parameter: 'session'.",
    param: 'session',
    event_id: null
  }
}

I cannot figure out the root cause of this issue and would greatly appreciate your guidance.


What I am trying to achieve:

  1. Send real-time audio data from an Android app to OpenAI’s Realtime API for transcription.

  2. Use a Node.js server to act as a proxy, relaying the audio data from the Android app to the API.


Development environment:

    • Server: Node.js v14 with WebSocket server (Port: 3000)
  • Client: Android app (Java)
  • OpenAI Model: gpt-4o-realtime-preview-2024-10-01
  • Node.js dependencies: ws, mic, dotenv

Key parts of my server code:

  1. Session initialization (to start the transcription session):
const sessionRequest = {
  type: 'session.update',
  modalities: ['audio', 'text'],
  instructions: 'Transcribe audio in real-time.',
  input_audio_format: 'pcm16',
  input_audio_transcription: { model: 'whisper-1' },
  turn_detection: {
    type: 'server_vad',
    threshold: 0.5,
    prefix_padding_ms: 300,
    silence_duration_ms: 500,
  },
};

ws.send(JSON.stringify(sessionRequest));

  1. Sending audio data to OpenAI:
if (sessionId) {
  ws.send(
    JSON.stringify({
      type: 'input_audio_buffer.append',
      session: sessionId,
      audio: message.toString('base64'),
      encoding: 'pcm16',
    })
  );

  ws.send(
    JSON.stringify({
      type: 'input_audio_buffer.commit',
      session: sessionId,
    })
  );
}

Android app code:

The Android app captures audio in real-time and sends it to the local WebSocket server (port 3000). Below is the relevant part of the code:

public void sendAudioData(byte[] audioData) {
    if (webSocket != null) {
        String base64Audio = Base64.encodeToString(audioData, Base64.NO_WRAP);
        webSocket.send(base64Audio);
        Log.d("RealTimeTranslationService", "Audio data sent: " + base64Audio.substring(0, 100)); // Log the first 100 characters of the data
    } else {
        Log.e("RealTimeTranslationService", "WebSocket is not connected.");
    }
}

Problem description:

  1. The OpenAI API returns an error: Unknown parameter: 'session', indicating that the session parameter is invalid.

  2. According to the documentation, the session parameter is required. However, the API rejects it as unknown.


Questions:

  1. How can I resolve the “unknown parameter: ‘session’” error?
  2. Is there an issue with the data being sent from the Android app or the server to the OpenAI API?
  3. Is my request structure to the OpenAI API correct, or are there updates to the Realtime API documentation I may have missed?

Any guidance or insights would be greatly appreciated. Thank you for your help!

Both input_audio_buffer.append and input_audio_buffer.commit don’t have a session parameter. Check API reference.

Also, if you are using server-side VAD, there is no point in sending commit events in general, and in your code it seems like it sends a commit on every audio buffer, which just doesn’t make sense no matter how you look at it.

Each commit will create a conversation item, and with a reasonably small audio buffer, there is no point in doing that because no reasonable semantics can be captured from such a small audio chunk.

The input_audio_buffer.commit event needs to be sent only when you are not using server-side VAD, but using your own VAD instead. With this event, you can create a conversation item in the session right after you have detected that the user has stopped talking. You would then send a response.create in order to trigger a response.

In you case though, you should avoid using it.

Regarding your android app code, I doubt it works correctly, because you are sending base64-encoded audio chunks directly to the websocket, whereas you need to send a proper event json just like in your js code above.

2 Likes

Additional, tangential point: you should not use this API for real-time transcription only.

It’s built around a multi-modal model and transcription itself is not an intended functionality. The model is not tuned for this kind of stuff, and the fact that the model can receive audio inputs does not change this. In simple terms, the model can think about your audio, but not turn audio into text.

But the bigger factor is that this API is not cost-effective for this use case in comparison with competition.

1 Like