first, i folowed the openai docs and successfully implemented the gpt-realtime conversation (using webrtc),
next, am trying to implement the transscription with the realtime (rt).
what is want is: 1. talk in mic, 2. the text appears in the UI as is, 3. 4o-transcribe does whatever is in its prompt to post process
following the docs i tried but i keep getting errors that are sent by the api and it seems the docs/actual usage has some issue. quite possibly, i cd be doing something wrong as well.
currently, i get a first transcripted but its not what i said and its in korean or japanese. after this the speech stops.
sendSessionUpdate()
private sendSessionUpdate(): void { if (!this.websocket || this.websocket.readyState !== WebSocket.OPEN) { return; } const configMessage = { type: 'transcription_session.update', // ❌ API rejects this input_audio_format: 'pcm16', input_audio_transcription: { model: 'gpt-4o-transcribe', prompt: '', language: 'en' }, turn_detection: { type: 'server_vad', threshold: this.config.turn_threshold, prefix_padding_ms: this.config.turn_prefix_padding_ms, silence_duration_ms: this.config.turn_silence_duration_ms }, input_audio_noise_reduction: { type: this.config.noise_reduction }, include: ['item.input_audio_transcription.logprobs'] }; console.log('Sending configuration:', configMessage); this.websocket.send(JSON.stringify(configMessage)); }
websocket:
const subprotocols = [
“realtime”,
openai-insecure-api-key.${clientSecret}
];
this.websocket = new WebSocket(wss://api.openai.com/v1/realtime?intent=transcription, subprotocols);
server side token gen:
const sessionConfig = {
session: {
type: “realtime”,
model: “gpt-realtime”, //This works
instructions: “You are a transcription assistant. Only transcribe the user’s speech accurately. Do not generate any responses or additional text. Only output the exact words spoken.”,
output_modalities: [“text”],
audio: {
input: {
transcription: {
model: “gpt-4o-transcribe” //This works
}
}
}
}
};
Is transcription_session.update actually supported? The API says no, but docs say yes.
How do we configure transcription-only mode? We want continuous transcription, not conversation.
What’s the correct message format? Should we use session.update with different parameters?
Is there a different endpoint or approach? Maybe we need a different URL or authentication method?
Why does it switch to response generation? After one transcription, it stops listening and starts generating responses.
any help wd be greatly appreciated!

