I would like to create an app that does realtime (or near realtime) Speech-to-Text.
I tested with Whisper but the delay to return the response was quite large, also I had to keep calling the API each few seconds.
So I found Openai Realtime API which might be a good option, I just don’t know if allows Speech-to-Text functionality, does anyone know?
3 Likes
Hi @frommars ! I’ve seen people apply Whisper in realtime but it usually involves running Whisper locally (it’s an open weight model!), and for best results it typically involves invoking one of the “fast” Whisper variants .
1 Like
aza
January 25, 2025, 7:52pm
3
I’m pretty sure the answer is no.
The transcription done data channel messages are only for the model generated responses. There is no message returned with a transacription of the audio input.
response.audio_transcript.done : Fires after receiving the assistant’s response.
conversation.item.input_audio_transcription.completed : Fires after transcribing user input.
output_audio_buffer.audio_stopped : Fires when the assistant stops speaking.
Prerequisite:
MUST have this update into the session:
const updateSession = {
type: "session.update",
event_id: "message_004",
session: {
input_audio_transcription: {
model: "whisper-1"
}
},
};
dataChannel.addEventListener("open", () => {
sendClientEvent(updateSession)
});
I am using Python SDK, even though you set the input_audio_transcription parameter, not getting the event from server. The Model is saying it can’t provide the transcription.
Yeah, the OpenAI Realtime API allows you to translate your voice to text. Take in account that it uses the whisper-1
model behind the scenes for this.
Here you can see an example in Java, but you can extrapolate to another context:
I have read several posts on this forum about problems with the Realtime API, especially two points:
Audio cuts off at the end of each AI response.
No audio transcript is received from the user.
I want to share my experience to overcome these problems. I want to warn that my experience is based on Q&A scenarios and the Java language in the backend.
Audio cuts off at the end of each AI response
After you finish speaking, you send a response.create request, then the AI sends audio fragments …