How to use transcription only with the Realtime Typescript SDK

Hi!

I’ve been struggling with the newly released realtime api. I want to have realtime transcription only (without agent talking back), but with no success yet.

I followed the documentation of the Typescript SDK here

With that, I’ve been able to successfully set up an agent, to whom I can speak though the microphone and listen the response on the speakers. With a few modifications, I could log the transcription of what I said.

Here is the code:

import { RealtimeAgent, RealtimeSession } from "@openai/agents-realtime";

const agent = new RealtimeAgent({
  name: "Assistant",
  instructions: "You are a helpful assistant.",
});

const session = new RealtimeSession(agent, {
  model: "gpt-realtime",
  config: {
    inputAudioTranscription: {
      model: 'gpt-4o-transcribe',
      prompt: "Expect words related to programming, development, and technology.",
      language: 'es'
    }
  },
});


export async function connect() {
    try {

      await session.connect({
        // To get this ephemeral key string, you can run the following command or implement the equivalent on the server side:
        // curl -s -X POST https://api.openai.com/v1/realtime/client_secrets -H "Authorization: Bearer $OPENAI_API_KEY" -H "Content-Type: application/json" -d '{"session": {"type": "realtime", "model": "gpt-realtime"}}' | jq .value
        apiKey: 'ek_1234',
      });

      console.log('You are connected!');
      console.log("Transport:", session.transport);

      session.transport.on('error', (event) => {
        console.log('Transport error', event);
      });

      session.transport.on('session.created', (event) => {
        console.log('Session created', event);
      });

      session.transport.on('session.updated', (event) => {
        console.log('Session updated', event);
      });

      session.transport.on('conversation.item.input_audio_transcription.completed', (event) => {
        console.log('Audio transcription completed', event);
      });

      session.transport.on('conversation.item.input_audio_transcription.failed', (event) => {
        console.log('Audio transcription failed', event);
      });

    } catch (e) {
      console.error(e);
    }
}

when I run the application I get these logs in the console:

And then the agent talks back.

But I just want the transcription, I don’t want the audio response. After many attempts this is as far as I could get:

  1. Get an ephemeral key for the type “transcription” instead of “realtime”. So the body sent to https://api.openai.com/v1/realtime/client_secrets is {"session": {"type": "transcription"}}

When running the code again I get this error in the console:

Passing a realtime session update event to a transcription session is not allowed.

These update events are sent automatically. I never trigger any event.

I’ve tried setting the output only to text, and changing the instructions to indicate that I do not want any audio response but it all seems to be ignored, and probably this would be much less efficient that only transcribing.

Probably I’m using wrong the SDK. Maybe I should go deeper and have more control over the WebRTC protocol? Any hints?

Thanks in advance

I’ve figured it out using the WebRTC connection directly and skipping the SDK.

This is what I did:

First generate the ephemeral key like this:

POST https://api.openai.com/v1/realtime/client_secrets
{
    "session": {
        "type": "transcription",
        "audio": {
            "input": {
                "transcription": {
                    "language": "es",
                    "model": "gpt-4o-transcribe",
                    "prompt": "Expect words related to programming, development, and technology."
                },
                "noise_reduction": {
                    "type": "near_field"
                }
            }
        }
    }
}

It’s very important to specify the audio.input.transcription parameters. Otherwise the transcription events won’t be sent by the server. The language and prompt parameters really help to get a better transcription.

This will return the ephemeral key in the value field of the response body.

Then in the client code I did this:

const EPHEMERAL_KEY = 'ek_1234'; // Replace with the ephemeral key obtained in the previous step

let dc: RTCDataChannel;

export async function connectWebRTC() {
  console.log("Connecting via WebRTC...");

  // Create a peer connection
  const pc = new RTCPeerConnection();

  // Add local audio track for microphone input in the browser
  const ms = await navigator.mediaDevices.getUserMedia({
      audio: true,
  });

  pc.addTrack(ms.getTracks()[0]);

  // Set up data channel for sending and receiving events
  dc = pc.createDataChannel("oai-events");

  // Start the session using the Session Description Protocol (SDP)
  const offer = await pc.createOffer();
  await pc.setLocalDescription(offer);

  const baseUrl = "https://api.openai.com/v1/realtime/calls";
  const sdpResponse = await fetch(`${baseUrl}`, {
      method: "POST",
      body: offer.sdp,
      headers: {
          Authorization: `Bearer ${EPHEMERAL_KEY}`,
          "Content-Type": "application/sdp",
      },
  });

  const answer = {
      type: "answer",
      sdp: await sdpResponse.text(),
  };
  
  await pc.setRemoteDescription(answer);

  dc.addEventListener("open", (e) => {
      console.log("Data channel open", e);
  });

  // Listen for server events
  dc.addEventListener("message", (e) => {
      const event = JSON.parse(e.data);

      if (event.type === "conversation.item.input_audio_transcription.completed") {
        console.log("Transcription:", event.transcript);
      }

  });

  dc.addEventListener("error", (e) => {
      console.log(e);
  });

}

Then just call connectWebRTC() to establish the connection.

For simplicity sake I just copy pasted the ephemeral key into the code. In a real application that key should be provided by a server endpoint.

In the console I can see the transcription after I talk. It works pretty well, and it’s fast :slight_smile:

I based the example on this documentation https://platform.openai.com/docs/guides/realtime-webrtc. But that is for speech-to-speech implementations. I think it would be nice to have one for transcription only

If you wish to continue using the AgentsSDK try the following.

  1. Create your ephemeral key the same way, e.g “type”: “transcription”
  2. When creating the RealtimeSession instance, remove the ‘options‘ parameter that accepts the ‘RealtimeSessionOptions’ instance. AKA:

Now simply becomes:

const session = new RealtimeSession(agent);

Hope this solves your issue.