Can I use Openai Realtime API for Speech-to-Text?

I would like to create an app that does realtime (or near realtime) Speech-to-Text.

I tested with Whisper but the delay to return the response was quite large, also I had to keep calling the API each few seconds.

So I found Openai Realtime API which might be a good option, I just don’t know if allows Speech-to-Text functionality, does anyone know?

3 Likes

Hi @frommars ! I’ve seen people apply Whisper in realtime but it usually involves running Whisper locally (it’s an open weight model!), and for best results it typically involves invoking one of the “fast” Whisper variants.

1 Like

I’m pretty sure the answer is no.

The transcription done data channel messages are only for the model generated responses. There is no message returned with a transacription of the audio input.

  1. response.audio_transcript.done: Fires after receiving the assistant’s response.
  2. conversation.item.input_audio_transcription.completed: Fires after transcribing user input.
  3. output_audio_buffer.audio_stopped: Fires when the assistant stops speaking.

Prerequisite:
MUST have this update into the session:

const updateSession = {
  type: "session.update",
  event_id: "message_004",
  session: {
    input_audio_transcription: {
      model: "whisper-1"
    }
  },
};

dataChannel.addEventListener("open", () => {
        sendClientEvent(updateSession)
      });

I am using Python SDK, even though you set the input_audio_transcription parameter, not getting the event from server. The Model is saying it can’t provide the transcription.

Yeah, the OpenAI Realtime API allows you to translate your voice to text. Take in account that it uses the whisper-1 model behind the scenes for this.

Here you can see an example in Java, but you can extrapolate to another context: