How to use transcription only with the Realtime Typescript SDK

Hi!

I’ve been struggling with the newly released realtime api. I want to have realtime transcription only (without agent talking back), but with no success yet.

I followed the documentation of the Typescript SDK here

With that, I’ve been able to successfully set up an agent, to whom I can speak though the microphone and listen the response on the speakers. With a few modifications, I could log the transcription of what I said.

Here is the code:

import { RealtimeAgent, RealtimeSession } from "@openai/agents-realtime";

const agent = new RealtimeAgent({
  name: "Assistant",
  instructions: "You are a helpful assistant.",
});

const session = new RealtimeSession(agent, {
  model: "gpt-realtime",
  config: {
    inputAudioTranscription: {
      model: 'gpt-4o-transcribe',
      prompt: "Expect words related to programming, development, and technology.",
      language: 'es'
    }
  },
});


export async function connect() {
    try {

      await session.connect({
        // To get this ephemeral key string, you can run the following command or implement the equivalent on the server side:
        // curl -s -X POST https://api.openai.com/v1/realtime/client_secrets -H "Authorization: Bearer $OPENAI_API_KEY" -H "Content-Type: application/json" -d '{"session": {"type": "realtime", "model": "gpt-realtime"}}' | jq .value
        apiKey: 'ek_1234',
      });

      console.log('You are connected!');
      console.log("Transport:", session.transport);

      session.transport.on('error', (event) => {
        console.log('Transport error', event);
      });

      session.transport.on('session.created', (event) => {
        console.log('Session created', event);
      });

      session.transport.on('session.updated', (event) => {
        console.log('Session updated', event);
      });

      session.transport.on('conversation.item.input_audio_transcription.completed', (event) => {
        console.log('Audio transcription completed', event);
      });

      session.transport.on('conversation.item.input_audio_transcription.failed', (event) => {
        console.log('Audio transcription failed', event);
      });

    } catch (e) {
      console.error(e);
    }
}

when I run the application I get these logs in the console:

And then the agent talks back.

But I just want the transcription, I don’t want the audio response. After many attempts this is as far as I could get:

  1. Get an ephemeral key for the type “transcription” instead of “realtime”. So the body sent to https://api.openai.com/v1/realtime/client_secrets is {"session": {"type": "transcription"}}

When running the code again I get this error in the console:

Passing a realtime session update event to a transcription session is not allowed.

These update events are sent automatically. I never trigger any event.

I’ve tried setting the output only to text, and changing the instructions to indicate that I do not want any audio response but it all seems to be ignored, and probably this would be much less efficient that only transcribing.

Probably I’m using wrong the SDK. Maybe I should go deeper and have more control over the WebRTC protocol? Any hints?

Thanks in advance

1 Like

I’ve figured it out using the WebRTC connection directly and skipping the SDK.

This is what I did:

First generate the ephemeral key like this:

POST https://api.openai.com/v1/realtime/client_secrets
{
    "session": {
        "type": "transcription",
        "audio": {
            "input": {
                "transcription": {
                    "language": "es",
                    "model": "gpt-4o-transcribe",
                    "prompt": "Expect words related to programming, development, and technology."
                },
                "noise_reduction": {
                    "type": "near_field"
                }
            }
        }
    }
}

It’s very important to specify the audio.input.transcription parameters. Otherwise the transcription events won’t be sent by the server. The language and prompt parameters really help to get a better transcription.

This will return the ephemeral key in the value field of the response body.

Then in the client code I did this:

const EPHEMERAL_KEY = 'ek_1234'; // Replace with the ephemeral key obtained in the previous step

let dc: RTCDataChannel;

export async function connectWebRTC() {
  console.log("Connecting via WebRTC...");

  // Create a peer connection
  const pc = new RTCPeerConnection();

  // Add local audio track for microphone input in the browser
  const ms = await navigator.mediaDevices.getUserMedia({
      audio: true,
  });

  pc.addTrack(ms.getTracks()[0]);

  // Set up data channel for sending and receiving events
  dc = pc.createDataChannel("oai-events");

  // Start the session using the Session Description Protocol (SDP)
  const offer = await pc.createOffer();
  await pc.setLocalDescription(offer);

  const baseUrl = "https://api.openai.com/v1/realtime/calls";
  const sdpResponse = await fetch(`${baseUrl}`, {
      method: "POST",
      body: offer.sdp,
      headers: {
          Authorization: `Bearer ${EPHEMERAL_KEY}`,
          "Content-Type": "application/sdp",
      },
  });

  const answer = {
      type: "answer",
      sdp: await sdpResponse.text(),
  };
  
  await pc.setRemoteDescription(answer);

  dc.addEventListener("open", (e) => {
      console.log("Data channel open", e);
  });

  // Listen for server events
  dc.addEventListener("message", (e) => {
      const event = JSON.parse(e.data);

      if (event.type === "conversation.item.input_audio_transcription.completed") {
        console.log("Transcription:", event.transcript);
      }

  });

  dc.addEventListener("error", (e) => {
      console.log(e);
  });

}

Then just call connectWebRTC() to establish the connection.

For simplicity sake I just copy pasted the ephemeral key into the code. In a real application that key should be provided by a server endpoint.

In the console I can see the transcription after I talk. It works pretty well, and it’s fast :slight_smile:

I based the example on this documentation https://platform.openai.com/docs/guides/realtime-webrtc. But that is for speech-to-speech implementations. I think it would be nice to have one for transcription only

3 Likes

If you wish to continue using the AgentsSDK try the following.

  1. Create your ephemeral key the same way, e.g “type”: “transcription”
  2. When creating the RealtimeSession instance, remove the ‘options‘ parameter that accepts the ‘RealtimeSessionOptions’ instance. AKA:

Now simply becomes:

const session = new RealtimeSession(agent);

Hope this solves your issue.

3 Likes