Retrieving user response from Realtime Voice WebRTC

Developing a chat bot with real time voice to voice using the new WebRTC.
When using voice to voice, which is the parameter that returns your transcribed text after your turn when you speak? it says the model defaults to whisper-1, when I try to access conversation.item.create or input_audio_transcription.completed it’s returning null in the transcript values

https://platform.openai.com/docs/api-reference/realtime-client-events/conversation/item/create

or

https://platform.openai.com/docs/api-reference/realtime-server-events/conversation/item/input_audio_transcription/completed

In my logs you can see the transcript parameter is null for conversation.item.create

1. {type: 'conversation.item.created', event_id: 'event_AgD4grkMnnQAX5meG0uSX', previous_item_id: null, item: {…}}

  1. event_id: "event_AgD4grkMnnQAX5meG0uSX"
  2. item:

    1. content: Array(1)

      1. 0: {type: 'input_audio', transcript: null}
      2. length: 1
      3. [[Prototype]]: Array(0)

    2. id: "item_AgD4gCkEkQZuilVoJrVs2"
    3. object: "realtime.item"
    4. role: "user"
    5. status: "completed"
    6. type: "message"
    7. [[Prototype]]: Object

  3. previous_item_id: null
  4. type: "conversation.item.created"
  5. [[Prototype]]: Object

But in the docs

    "event_id": "event_1920",
    "type": "conversation.item.created",
    "previous_item_id": "msg_002",
    "item": {
        "id": "msg_003",
        "object": "realtime.item",
        "type": "message",
        "status": "completed",
        "role": "user",
        "content": [
            {
                "type": "input_audio",
                "transcript": "hello how are you",
                "audio": "base64encodedaudio=="
            }
        ]
    }
}

The transcript parameter is filled.

Does anyone have any idea why this value is coming back null?

Hi,

Have you updated your session to turn on audio input transcriptions? Because by default it is off :

      "https://api.openai.com/v1/realtime/sessions",
      {
        method: "POST",
        headers: {
          Authorization: `Bearer ${process.env.OPEN_AI_API_KEY}`,
          "Content-Type": "application/json",
        },
        body: JSON.stringify({
          model: "gpt-4o-realtime-preview-2024-12-17",
          voice: "alloy",
          modalities: ["audio", "text"],
          instructions: instructions,
          input_audio_transcription: {
            model: "whisper-1",
          },
          temperature: 1.1,
        }),
      }
    );
    const openAISessionData = await response.json();
    console.log("Session data received:", openAISessionData);
    console.log("OpenAI session created successfully");

    // Return ephemeral session info
    return res.status(200).json({
      success: true,
      sessionData: {
        id: openAISessionData.id,
        token: openAISessionData.client_secret.value,
        model: openAISessionData.model,
        object: openAISessionData.object,
        expires_at: openAISessionData.client_secret.expires_at,
        modalities: openAISessionData.modalities,
        url: "https://api.openai.com/v1/realtime",
        input_audio_transcription: openAISessionData.input_audio_transcription,
        turn_detection: openAISessionData.turn_detection,
        temperature: openAISessionData.temperature,
      },
    });

This is what I have, unless im just supposed to set input_audio_transcription to true

I honestly don’t know lol I have just read the docs to try and help you out but I haven’t implemented the realtime api myself yet

What happens if you try to set it to true in a session update message after you’ve created the session?

its being set to true i think cuz i have that parameter added, but im never getting the correct event back

for more context here is the front end code


  // Update the handleWebRTCResponse function
  const handleWebRTCResponse = async (serverEvent) => {
    if (serverEvent.type === "response.done") {
      // Extract text from the transcript in the audio content
      const responseText =
        serverEvent.response.output[0]?.content[0]?.transcript;
      if (responseText) {
        try {
          // Update local messages state first
          setMessages((prev) => [
            ...prev,
            {
              role: "assistant",
              content: responseText,
              isWebRTC: true,
              timestamp: Date.now(),
            },
          ]);

          // Then try to save to Firebase
          const response = await callAPI("saveWebRTCConversation", {
            sessionId: uniqueId,
            role: "assistant",
            content: responseText,
            timestamp: Date.now(),
          });

          if (!response.success) {
            console.error(
              "Failed to save message to Firebase:",
              response.error,
            );
            // Message is still in local state, but failed to save to Firebase
          }
        } catch (error) {
          console.error("Error saving assistant WebRTC message:", error);
          // Message is still in local state, but failed to save to Firebase
        }
      }
    } else if (
      serverEvent.type ===
      "conversation.item.input_audio_transcription.completed"
    ) {
      const transcript = serverEvent.transcript;
      if (transcript) {
        setMessages((prev) => [
          ...prev,
          {
            role: "user",
            content: transcript,
            isWebRTC: true,
            timestamp: Date.now(),
          },
        ]);
        const response = await callAPI("saveWebRTCConversation", {
          sessionId: uniqueId,
          role: "user",
          content: transcript,
          timestamp: Date.now(),
        });
        if (!response.success) {
          console.error("Failed to save user transcription:", response.error);
        }
      }
    }
  };

  // Also handle the output_item.done event
  const handleEvent = async (e) => {
    try {
      const serverEvent = JSON.parse(e.data);

      switch (serverEvent.type) {
        case "response.done":
          console.log("=== ASSISTANT MESSAGE ===");
          await handleWebRTCResponse(serverEvent);
          break;
        case "conversation.item.input_audio_transcription.completed":
          console.log("=== USER MESSAGE ===");
          await handleWebRTCResponse(serverEvent);
          break;
        default:
      }
    } catch (error) {
      console.error("Error in handleEvent:", error);
    }
  };

  // Add logging to the data channel setup
  useEffect(() => {
    if (dataChannel) {
      dataChannel.addEventListener("message", handleEvent);
      return () => {
        dataChannel.removeEventListener("message", handleEvent);
      };
    }
  }, [dataChannel]);

We’re never receiving any logs for conversation.item.input_audio_transcription.completed

in handleWebRTCResponse - its accurately getting the text from response.done, but not from the next case of conversation.item.input_audio_transcription.completed

I have the same issue. I’m setting the session param like this: input_audio_transcription: {model: ‘whisper-1’},

but when I get the session object back it changes to “input_audio_transcription”: null,

This is my code:

const openAIResponse = await fetch(‘https://api.openai.com/v1/realtime/sessions’, {
method: ‘POST’,
headers: {
Authorization: Bearer ${process.env.OPENAI_API_KEY}, // ENV variable in Convex
‘Content-Type’: ‘application/json’,
},
body: JSON.stringify({
model: ‘gpt-4o-mini-realtime-preview-2024-12-17’,
voice: ‘alloy’,
input_audio_transcription: {model: ‘whisper-1’},
instructions: instructions,
}),
});

1 Like

yep, that is following the documentation, as mine is. We’re not getting the response as we should be.

OpenAI Forum leaders, can you please figure this out

Hi, since you’re using the WebRTC version, did you face the problem of the AI looping and talking to itself. I just sent over a “Hi”, and it seems to recursively greet me with “Hey”, “Hello there”,… It seems like it’s hearing it’s own audio and responding. Do you know how to fix this issue?

That had happened to me once or twice but not consistently, I took it as it being too sensitive but maybe you’re right and its hearing itself

With me, it’s happening every time. I’m unable to have a simple conversation since it’s constantly talking to itself. When I put the speakers on mute, and read my logs, I can see the AI only responds to my questions, and not to itself. I made a post.

If you see something obviously wrong, please let me know!

What device are you using? I looked at your post and don’t see anything weird. The only thing I can think is that your device settings / the device you’re using youre getting feedback

Have you verified if the transcription failed? The failure arrives in another event:

conversation.item.input_audio_transcription.failed

I would try it yourself, it simply doesn’t work. You never get anything past input_audio_buffer_commited. We never get any audio transcript failed log when I log all the events from the web rtc.

It’s very much a bug in their API and if you search the forum people have been complaining about this since earlier this year

No its not, someone mentioned it in another post that you actually have to send the input audio transcription in a “session.update” after creating the session. It just doesnt work if you do it in session creation.

1 Like

This is correct. It only works if you update the session. I do it as soon as I get the session created event

const handleSessionCreated = useCallback((event) => {
        console.log('Session created:', event.session);
        const updateEvent = {
            type: 'session.update',
            event_id: sessionId.current,
            session: {
                input_audio_transcription: { model: 'whisper-1' },
            },
        };
        emitEvent(updateEvent);
    }, []);