Realtime transcription issue

I’m trying to transcribe audio using a WebSocket connection. The transcription session is successfully created, but I am not receiving the transcription text. Could you please guide me in resolving this issue?

this.ws = new WebSocket(`wss://api.openai.com/v1/realtime?intent=transcription`, [
                "realtime",
                `openai-insecure-api-key.${token}`,
                "openai-beta.realtime-v1"
            ]);
this.ws.onopen = () => {
                console.log('Connected to OpenAI realtime API');
                // Send configuration once connected
                
            };

            this.ws.onmessage = (event: MessageEvent) => {
console.log(event);
}

 audioWorkletNode.port.onmessage = (event) => {
                if (!this.ws || this.ws.readyState !== WebSocket.OPEN) return;

                const inputData = event.data.audio_data;
                // console.log(inputData)
                if (!inputData || inputData.length === 0) {
                    console.warn('Received empty audio data');
                    return;
                }

                const currentBuffer = new Int16Array(event.data.audio_data);

                // Use type assertion to assure TypeScript this is compatible
                audioBufferQueue = this.mergeBuffers(
                    audioBufferQueue,
                    currentBuffer
                );
                const bufferDuration =
                    (audioBufferQueue.length / this.transcriptionContext.sampleRate) * 1000;

                // wait until we have 100ms of audio data
                if (bufferDuration >= 100) {
                    const totalSamples = Math.floor(this.transcriptionContext.sampleRate * 0.1);

                    // Extract the portion we want to send
                    const dataToSend = audioBufferQueue.subarray(0, totalSamples);

                    // Encode the Int16Array to base64
                    const base64Audio = this.encodeInt16ArrayToBase64(dataToSend);

                    // Update our queue to remove the sent data
                    audioBufferQueue = audioBufferQueue.subarray(totalSamples);
                    // Convert to the format OpenAI expects (16-bit PCM)
                    // const audioBuffer = this.floatTo16BitPCM(finalBuffer);
                    // const base64Audio = this.arrayBufferToBase64(audioBuffer);


                    // Send the audio data to OpenAI
                    this.ws.send(JSON.stringify({
                        type: 'input_audio_buffer.append',
                        audio: base64Audio
                    }));
                    // this.ws.send(JSON.stringify({
                    //     type: 'response.create',
                    // }));
                }
            };

Here I have attached the screenshot of the log also. I couldn’t able to update the session to use gpt-40-mini-transcribe model. I want to use this feature in production site. Could you please guide me to resolve this issue?

2 Likes

I have successfully connected the transcription session and got the transcript. I’m encountering the following issues:

OpenAI Transcription automatically detects the spoken language and returns the transcript accordingly. However, during testing, I noticed that even when speaking in English, it sometimes detects the language incorrectly—like Spanish or French —and returns the transcript in that language. Additionally, in some cases, the transcript is incomplete or gets cut off.

I have used the Server_VAD for voice activity detection. Anyone guide me to resolve it.

1 Like

Hi Nathiya, here’s my code functional code, hope it helps !

import os
import json
import base64
import asyncio
import logging
import aiohttp
import websockets
from dotenv import load_dotenv

load_dotenv()
OPENAI_API_KEY = os.getenv("OPENAI_API_KEY")
if not OPENAI_API_KEY:
    raise ValueError("Missing OpenAI API key.")

logging.basicConfig(level=logging.DEBUG, format="%(asctime)s [%(levelname)s] %(message)s")
logger = logging.getLogger(__name__)

final_transcription = ""

async def create_transcription_session():
    """
    Create a transcription session via the REST API to obtain an ephemeral token.
    This endpoint uses the beta header "OpenAI-Beta: assistants=v2".
    """
    url = "https://api.openai.com/v1/realtime/transcription_sessions"
    payload = {
        "input_audio_format": "g711_ulaw",
        "input_audio_transcription": {
            "model": "gpt-4o-transcribe",
            "language": "en",
            "prompt": "Transcribe the incoming audio in real time."
        },
    
        "turn_detection": {"type": "server_vad", "silence_duration_ms": 1000}
    }
    headers = {
        "Authorization": f"Bearer {OPENAI_API_KEY}",
        "Content-Type": "application/json",
        "OpenAI-Beta": "assistants=v2"
    }
    async with aiohttp.ClientSession() as session:
        async with session.post(url, json=payload, headers=headers) as resp:
            if resp.status != 200:
                text = await resp.text()
                raise Exception(f"Failed to create transcription session: {resp.status} {text}")
            data = await resp.json()
            ephemeral_token = data["client_secret"]["value"]
            logger.info("Transcription session created; ephemeral token obtained.")
            return ephemeral_token

async def send_audio(ws, file_path: str, chunk_size: int, speech_stopped_event: asyncio.Event):
    """
    Read the local ulaw file and send it in chunks.
    After finishing, wait for 1 second to see if the server auto-commits.
    If not, send a commit event manually.
    """
    try:
        with open(file_path, "rb") as f:
            while True:
                chunk = f.read(chunk_size)
                if not chunk:
                    break
                # Base64-encode the audio chunk.
                audio_chunk = base64.b64encode(chunk).decode("utf-8")
                audio_event = {
                    "type": "input_audio_buffer.append",
                    "audio": audio_chunk
                }
                await ws.send(json.dumps(audio_event))
                await asyncio.sleep(0.02)  # simulate real-time streaming
        logger.info("Finished sending audio file.")

        # Wait 1 second to allow any late VAD events before committing.
        try:
            await asyncio.wait_for(speech_stopped_event.wait(), timeout=1.0)
            logger.debug("Speech stopped event received; no manual commit needed.")
        except asyncio.TimeoutError:
            commit_event = {"type": "input_audio_buffer.commit"}
            await ws.send(json.dumps(commit_event))
            logger.info("Manually sent input_audio_buffer.commit event.")
    except FileNotFoundError:
        logger.error(f"Audio file not found: {file_path}")
    except Exception as e:
        logger.error("Error sending audio: %s", e)

async def receive_events(ws, speech_stopped_event: asyncio.Event):
    """
    Listen for events from the realtime endpoint.
    Capture transcription deltas and the final complete transcription.
    Set the speech_stopped_event when a "speech_stopped" event is received.
    """
    global final_transcription
    try:
        async for message in ws:
            try:
                event = json.loads(message)
                event_type = event.get("type")
                if event_type == "input_audio_buffer.speech_stopped":
                    logger.debug("Received event: input_audio_buffer.speech_stopped")
                    speech_stopped_event.set()
                elif event_type == "conversation.item.input_audio_transcription.delta":
                    delta = event.get("delta", "")
                    logger.info("Transcription delta: %s", delta)
                    final_transcription += delta
                elif event_type == "conversation.item.input_audio_transcription.completed":
                    completed_text = event.get("transcript", "")
                    logger.info("Final transcription completed: %s", completed_text)
                    final_transcription = completed_text  # Use the completed transcript
                    break  # Exit after final transcription
                elif event_type == "error":
                    logger.error("Error event: %s", event.get("error"))
                else:
                    logger.debug("Received event: %s", event_type)
            except Exception as ex:
                logger.error("Error processing message: %s", ex)
    except Exception as e:
        logger.error("Error receiving events: %s", e)

async def test_transcription():
    try:
        # Step 1: Create transcription session and get ephemeral token.
        ephemeral_token = await create_transcription_session()

        # Step 2: Connect to the base realtime endpoint.
        websocket_url = "wss://api.openai.com/v1/realtime"
        connection_headers = {
            "Authorization": f"Bearer {ephemeral_token}",
            "OpenAI-Beta": "realtime=v1"
        }
        async with websockets.connect(websocket_url, additional_headers=connection_headers) as ws:
            logger.info("Connected to realtime endpoint.")

            # Step 3: Send transcription session update event with adjusted VAD settings.
            update_event = {
                "type": "transcription_session.update",
                "session": {
                    "input_audio_transcription": {
                        "model": "gpt-4o-transcribe",
                        "language": "en",
                        "prompt": "Transcribe the incoming audio in real time."
                    },
                    # Matching the REST API settings
                    "turn_detection": {"type": "server_vad", "silence_duration_ms": 1000}
                }
            }
            await ws.send(json.dumps(update_event))
            logger.info("Sent transcription session update event.")

            # Create an event to signal if speech stopped is detected.
            speech_stopped_event = asyncio.Event()

            # Step 4: Run sender and receiver concurrently.
            sender_task = asyncio.create_task(send_audio(ws, "static/Welcome.ulaw", 1024, speech_stopped_event))
            receiver_task = asyncio.create_task(receive_events(ws, speech_stopped_event))
            await asyncio.gather(sender_task, receiver_task)

            # Print the final transcription.
            logger.info("Final complete transcription: %s", final_transcription)
            print("Final complete transcription:")
            print(final_transcription)

    except Exception as e:
        logger.error("Error in transcription test: %s", e)

if __name__ == "__main__":
    asyncio.run(test_transcription())

// Crear una conexión WebSocket
this.ws = new WebSocket(wss://api.openai.com/v1/realtime?intent=transcription, [
“realtime”,
openai-insecure-api-key.${token},
“openai-beta.realtime-v1”
]);

// Procesamiento cuando se establece la conexión WebSocket
this.ws.onopen = () => {
console.log (‘conectado a la API real en tiempo real de OpenAI’);

// Recibe el mensaje en caché de LocalStorage y envíelo
Let CachedMessages = json.parse (localStorage.getItem ('CachedMessages') || '[]');
CachedMessages.ForEach (msg => this.ws.send (msg));

// borrar el caché después de enviar
localStorage.removeItem ('CachedMessages');

};

// Procesamiento al recibir mensajes de WebSocket
this.ws.onmessage = (event) => {
console.log (evento);
};

// Manejo cuando la conexión WebSocket está cerrada
this.ws.onclose = () => {
console.log (‘Conexión WebSocket cerrada’);

// Registre el tiempo de desconexión para evitar conexiones pesadas frecuentes
localStorage.SetItem ('lastdisconnect', date.now ());

};

// Manejo de errores de WebSocket
this.ws.onerror = (error) => {
console.error(‘WebSocket error:’, error);
};

// Audio data processing
audioWorkletNode.port.onmessage = (event) => {
if (!this.ws || this.ws.readyState !== WebSocket.OPEN) return;

const inputData = event.data.audio_data;
if (!inputData || inputData.length === 0) {
    console.warn('Received empty audio data');
    return;
}

const currentBuffer = new Int16Array(inputData);

// Merge audio data
audioBufferQueue = this.mergeBuffers(audioBufferQueue, currentBuffer);
const bufferDuration = (audioBufferQueue.length / this.transcriptionContext.sampleRate) * 1000;

// Make sure the audio data is at least 100 milliseconds
if (bufferDuration >= 100) {
    const totalSamples = Math.floor(this.transcriptionContext.sampleRate * 0.1);

    // Extract audio data to be sent
    const dataToSend = audioBufferQueue.subarray(0, totalSamples);

    // Encode audio data to base64
    const base64Audio = this.encodeInt16ArrayToBase64(dataToSend);

    // Update the queue and remove sent data
    audioBufferQueue = audioBufferQueue.subarray(totalSamples);

    // Send audio data to OpenAI
    this.ws.send(JSON.stringify({
        type: 'input_audio_buffer.append',
        audio: base64Audio
    }));
}

};

// Use HTTP request to send data when WebSocket is disconnected
function sendMessage(data) {
if (this.ws.readyState === WebSocket.OPEN) {
this.ws.send(data);
} else {
console.warn(‘WebSocket disconnected, using HTTP to send data’);

    //Cached data, waiting for WebSocket to recover
    let cachedMessages = JSON.parse(localStorage.getItem('cachedMessages') || '[]');
    cachedMessages.push(data);
    localStorage.setItem('cachedMessages', JSON.stringify(cachedMessages));

    // Use fetch to send data
    fetch('https://example.com/api/fallback', {
        method: 'POST',
        body: JSON.stringify(data),
        headers: { 'Content-Type': 'application/json' }
    }).then(response => response.json())
        .then(result => console.log('HTTP backup transfer succeeded:', result))
        .catch(error => console.error('HTTP transfer failed:', error));
}

}

// Use sendBeacon to send the last data when the page is closed
window.addEventListener(‘beforeunload’, () => {
let cachedMessages = localStorage.getItem(‘cachedMessages’);
if (cachedMessages) {
console.log(‘Sending remaining data before page unload’);
navigator.sendBeacon(‘https://example.com/api/analytics’, cachedMessages);
}
});

WebSocket Connection Management:

We established a WebSocket connection according to the original code and listened to the onopen, onmessage, onclose and onerror events of the connection.

In the onopen event, we get the cached message from localStorage and send it, clear the cache after the connection is restored.

Audio data transmission:

Use audioWorkletNode.port.onmessage to monitor the transfer of audio data. When receiving audio data, we ensure that the duration of the audio data is greater than or equal to 100ms before sending it.

After ensuring sufficient data, encode the audio data to Base64 and send it to WebSocket.

Use HTTP transport when WebSocket is disconnected:

When WebSocket is disconnected, we use the fetch method to send audio data to the standby server over HTTP requests. At the same time, cache this data and continue to be sent when WebSocket is restored.

This ensures that data is not lost when the WebSocket connection is disconnected.

The last data sent when the page is closed:

Use navigator.sendBeacon to send cached message data when the page is uninstalled to ensure that the last data can be sent out.

Key points:
Caching mechanism: cache unsent data through localStorage to ensure that WebSocket continues to be sent after recovery.

Backup mechanism: When WebSocket is disconnected, HTTP requests are used as the backup transmission method to ensure reliable data transmission.

Guarantees when uninstalling the page: Use sendBeacon to ensure that data can be sent out even when the page is closed.

Hi Tony, thanks for your reply. I tested the sample code you provided. It works good. But I’m trying to do real time transcription. Here is my code.

interface TranscriberCallbacks {
    onInterimTranscript: (text: string) => void;
    onFinalTranscript: (text: string) => void;
    onError: (error: string) => void;
}

export default class OpenAITranscriber {
    private ws: WebSocket | null = null;
    private transcriptionContext: AudioContext | null = null;
    private workletNode: AudioWorkletNode | null = null;
    private isManualStop: boolean = false;
    private sessionTimeout: number;
    private userStream: MediaStream;
    private onInterimTranscript: (text: string) => void;
    private onFinalTranscript: (text: string) => void;
    private onError: (error: string) => void;
    private reconnectAttempts = 0;
    private maxReconnectAttempts = 3;


    constructor(
        sessionTimeout: number,
        userStream: MediaStream,
        callbacks: TranscriberCallbacks
    ) {
        this.sessionTimeout = sessionTimeout;
        this.userStream = userStream;
        this.onInterimTranscript = callbacks.onInterimTranscript;
        this.onFinalTranscript = callbacks.onFinalTranscript;
        this.onError = callbacks.onError;
    }

    private async fetchAccessToken(): Promise<string> {
        try {
            const expiresIn = Math.floor(this.sessionTimeout / 1000); // Convert to seconds
            const response = await fetch('/api/openai-token', {
                method: 'POST',
                headers: {
                    'Content-Type': 'application/json',
                },
                body: JSON.stringify({
                    expiresIn,
                }),
            });

            if (response.ok) {
                const data = await response.json();
                console.log(data);
                return data.client_secret.value;
            }
            else {
                const data = await response.text();
                throw new Error(`${JSON.parse(data).error}`);
            }
        } catch (error) {
            console.error(`Error fetching access token for STT: ${error}`);
            throw error;
        }
    }

    public async start(): Promise<void> {
        if (this.ws) {
            console.warn('Transcription is already in progress');
            return;
        }

        try {
            const token = await this.fetchAccessToken();
            if (!token) throw new Error('Failed to fetch access token');

            // Connect to OpenAI's realtime API
            this.ws = new WebSocket(`wss://api.openai.com/v1/realtime?intent=transcription`, [
                "realtime",
                `openai-insecure-api-key.${token}`,
                "openai-beta.realtime-v1"
            ]);


            this.ws.onopen = () => {
                console.log('Connected to OpenAI realtime API');
                this.reconnectAttempts = 0;
                if (this.ws?.readyState === 1) {
                    this.ws.send(JSON.stringify({
                        type: "transcription_session.update",
                        session: {
                            input_audio_transcription: {
                                model: "gpt-4o-mini-transcribe",
                                language: "en",
                                // prompt: "Transcribe the user's speech and translate it into English accurately while maintaining the original meaning and correct grammar."
                            },
                            turn_detection: {
                                prefix_padding_ms: 600,
                                silence_duration_ms: 800,
                                type: "server_vad",
                                threshold: 0.5
                            }
                        },
                    }));
                }

            };

            this.ws.onmessage = (event: MessageEvent) => {
                const data = JSON.parse(event.data);
                this.handleMessage(data);
            };

            this.ws.onerror = (error: Event) => {
                console.error('WebSocket error:', error);
                this.onError(`Connection error: ${error}`);

                this.stop();
                this.reconnect();
            };

            this.ws.onclose = () => {
                console.log('WebSocket connection closed');
                if (!this.isManualStop) {
                    this.stop();
                    this.reconnect();
                }
            };

            await this.initAudioProcessing();

        } catch (error) {
            console.error('Error starting transcription:', error);
            this.onError(error instanceof Error ? error.message : 'Unknown error');
            this.stop();
        }
    }

    private handleMessage(data: any): void {
        switch (data.type) {
            case 'transcription_session.created':
                if (data?.session)
                    console.log("[transcription_session.created]", "Audio format: ", data.session.input_audio_format, "Expires at: ", data.session.expires_at, "Silence duration in ms: ", data.session.turn_detection.silence_duration_ms);
                break;

            case 'transcription_session.updated':
                if (data?.session)
                    console.log("[transcription_session.updated]", "Audio format: ", data.session.input_audio_format, "Expires at: ", data.session.expires_at, "Model: ", data.session.input_audio_transcription.model, "Silence duration in ms: ", data.session.turn_detection.silence_duration_ms);
                break;

            case 'conversation.item.created':
                if (data?.session)
                    console.log("[conversation.item.created]", "Conversation Status: ", data.item.status);
                break;

            case 'conversation.item.input_audio_transcription.delta':
                if (data?.transcript) {
                    console.log("[conversation.item.input_audio_transcription.delta]", "Transcript: ", data.transcript);
                    // this.onInterimTranscript(data.transcript);
                }
                break;

            case 'conversation.item.input_audio_transcription.completed':
                if (data?.transcript) {
                    console.log("[conversation.item.input_audio_transcription.completed]", "Transcript: ", data.transcript);
                    this.onFinalTranscript(data.transcript);
                }
                break;

            case 'error':
                if (data?.error) {
                    console.log("[error]", "Type: ", data.error.type, "Code: ", data.error.code, "Error: ", data.error.message);
                    this.onError(data.error.message);
                }
                break;

            default:
                console.log('Unhandled message type:', data.type);
        }
    }

    private async initAudioProcessing(): Promise<void> {
        try {
            this.transcriptionContext = new AudioContext({
                sampleRate: 24000,
                latencyHint: 'balanced'
            });
            const source = this.transcriptionContext.createMediaStreamSource(this.userStream);

            await this.transcriptionContext.audioWorklet.addModule('audio-processor.js');
            this.workletNode = new AudioWorkletNode(this.transcriptionContext, 'audio-processor');

            source.connect(this.workletNode);
            this.workletNode.connect(this.transcriptionContext.destination);

            let audioBufferQueue = new Int16Array(0);
            // Process audio data
            this.workletNode.port.onmessage = (event) => {

                if (!this.ws || this.ws.readyState !== WebSocket.OPEN) return;

                const currentBuffer = new Int16Array(event.data.audio_data);

                // Use type assertion to assure TypeScript this is compatible
                audioBufferQueue = this.mergeBuffers(
                    audioBufferQueue,
                    currentBuffer
                );
                const bufferDuration =
                    (audioBufferQueue.length / this.transcriptionContext.sampleRate) * 1000;

                // wait until we have 100ms of audio data
                if (bufferDuration >= 100) {
                    const totalSamples = Math.floor(this.transcriptionContext.sampleRate * 0.1);

                    // Extract the portion we want to send
                    const finalBuffer = audioBufferQueue.subarray(0, totalSamples);

                    // Update our queue to remove the sent data
                    audioBufferQueue = audioBufferQueue.subarray(totalSamples);

                    // Encode the Int16Array to base64
                    const base64Audio1 = this.encodeInt16ArrayToBase64(finalBuffer);

                    // Send the audio data to OpenAI
                    this.ws.send(JSON.stringify({
                        type: 'input_audio_buffer.append',
                        audio: base64Audio1
                    }));
                }
            };
        } catch (error) {
            console.error('Error initializing audio processing:', error);
            throw error;
        }
    }

    private mergeBuffers(lhs: Int16Array, rhs: Int16Array): Int16Array {
        const mergedBuffer = new Int16Array(lhs.length + rhs.length);
        mergedBuffer.set(lhs, 0);
        mergedBuffer.set(rhs, lhs.length);
        return mergedBuffer;
    }

    private encodeInt16ArrayToBase64(int16Array: Int16Array): string {
        // Directly use the Int16Array's underlying buffer
        const bytes = new Uint8Array(int16Array.buffer);

        // Chunked processing for large arrays (avoids call stack limits)
        const chunkSize = 0x8000; // 32KB chunks
        let binary = '';

        for (let i = 0; i < bytes.length; i += chunkSize) {
            const chunk = bytes.subarray(i, Math.min(i + chunkSize, bytes.length));
            binary += String.fromCharCode(...chunk); // Spread operator
        }

        return btoa(binary);
    }

    public async stop(): Promise<void> {
        this.isManualStop = true; // Set the manual stop flag

        if (this.ws) {
            this.ws.close();
            this.ws = null;
        }

        if (this.transcriptionContext) {
            await this.transcriptionContext.close();
            this.transcriptionContext = null;
        }

        if (this.workletNode) {
            this.workletNode.port.onmessage = null;
            this.workletNode.disconnect();
            this.workletNode = null;
        }
    }

    private async reconnect() {
        if (this.reconnectAttempts >= this.maxReconnectAttempts) {
            console.log('Max reconnect attempts reached. Stopping further attempts.');
            return; // Stop further reconnection attempts
        }
        console.log(`Attempting to reconnect... (${this.reconnectAttempts + 1}/${this.maxReconnectAttempts})`);
        this.isManualStop = false; // Reset the manual stop flag
        this.reconnectAttempts++; // Increment attempt counter
        try {
            await this.start();
        } catch (error) {
            console.error(`Reconnect attempt ${this.reconnectAttempts} failed:`, error);
        }
    }
}

I’m developing a customer support application that requires real-time transcription. The code is functioning correctly, and I have successfully converted the microphone stream into base64. However, I’m facing an issue where the language detection is inconsistent. Even when speaking in English, it sometimes detects a different language—such as Spanish or French—and returns the transcript accordingly. Additionally, in some cases, the transcription is incomplete or gets cut off. Once again thanks for your reply.

Hi,

Thanks for your reply. I have shared my code and the issues. Could you please check it?

1 Like

The correct URL is "wss://api.openai.com/v1/realtime?intent=transcription" (you missed intent= transcription ).

I have used the correct URL. Please check my code.

Sorry I was replying to @TonyStark, you can see on the top right of my message (the updated UI is very bad).

Optimized WebSocket-Based OpenAI Real-Time Transcription

Overview

This code provides an optimized implementation of a WebSocket-based OpenAI real-time transcription system. The original version had a few issues, including inconsistent language detection, incomplete transcripts, and unnecessary redundant reconnection logic. This refined version ensures:

  • Stable WebSocket Connection: Properly handles errors and prevents excessive reconnection attempts.
  • Accurate Audio Processing: Efficiently encodes and sends audio data.
  • Error Handling: Avoids redundant reconnection while maintaining stability.

Key Improvements

  1. Fixed Redundant Reconnection Issue
  • Previously, multiple layers of reconnection logic caused infinite reconnection loops.
  • Now, the reconnection attempts are capped, preventing unnecessary connections.
  1. Improved Language Detection Stability
  • Instead of relying on automatic detection, the language is explicitly set to English (language: "en" ).
  • This prevents OpenAI from incorrectly recognizing other languages when the user is speaking English.
  1. Optimized Audio Buffer Handling
  • The previous implementation sometimes sent incomplete audio buffers.
  • Now, it ensures that audio chunks are sent only when they contain sufficient data (100ms threshold).

Refined Code

interface TranscriberCallbacks {
    onInterimTranscript: (text: string) => void;
    onFinalTranscript: (text: string) => void;
    onError: (error: string) => void;
}

export default class OpenAITranscriber {
    private ws: WebSocket | null = null;
    private transcriptionContext: AudioContext | null = null;
    private workletNode: AudioWorkletNode | null = null;
    private isManualStop: boolean = false;
    private sessionTimeout: number;
    private userStream: MediaStream;
    private onInterimTranscript: (text: string) => void;
    private onFinalTranscript: (text: string) => void;
    private onError: (error: string) => void;
    private reconnectAttempts = 0;
    private maxReconnectAttempts = 3;

    constructor(sessionTimeout: number, userStream: MediaStream, callbacks: TranscriberCallbacks) {
        this.sessionTimeout = sessionTimeout;
        this.userStream = userStream;
        this.onInterimTranscript = callbacks.onInterimTranscript;
        this.onFinalTranscript = callbacks.onFinalTranscript;
        this.onError = callbacks.onError;
    }

    private async fetchAccessToken(): Promise<string> {
        try {
            const response = await fetch('/api/openai-token', {
                method: 'POST',
                headers: { 'Content-Type': 'application/json' },
                body: JSON.stringify({ expiresIn: Math.floor(this.sessionTimeout / 1000) })
            });
            if (!response.ok) throw new Error(await response.text());
            const data = await response.json();
            return data.client_secret.value;
        } catch (error) {
            throw new Error(`Token fetch failed: ${error}`);
        }
    }

    public async start(): Promise<void> {
        if (this.ws) return;

        try {
            const token = await this.fetchAccessToken();
            this.ws = new WebSocket(`wss://api.openai.com/v1/realtime?intent=transcription`, [
                "realtime", `openai-insecure-api-key.${token}`, "openai-beta.realtime-v1"
            ]);

            this.ws.onopen = () => {
                this.reconnectAttempts = 0;
                this.ws?.send(JSON.stringify({
                    type: "transcription_session.update",
                    session: {
                        input_audio_transcription: { model: "gpt-4o-mini-transcribe", language: "en" },
                        turn_detection: { prefix_padding_ms: 600, silence_duration_ms: 800, type: "server_vad", threshold: 0.5 }
                    }
                }));
            };

            this.ws.onmessage = (event) => this.handleMessage(JSON.parse(event.data));
            this.ws.onerror = () => this.handleError("WebSocket error");
            this.ws.onclose = () => this.reconnect();

            await this.initAudioProcessing();
        } catch (error) {
            this.onError(error instanceof Error ? error.message : 'Unknown error');
            this.stop();
        }
    }

    private handleMessage(data: any): void {
        if (data.type === 'conversation.item.input_audio_transcription.completed') {
            this.onFinalTranscript(data.transcript);
        } else if (data.type === 'error') {
            this.onError(data.error.message);
        }
    }

    private async initAudioProcessing(): Promise<void> {
        this.transcriptionContext = new AudioContext({ sampleRate: 24000 });
        const source = this.transcriptionContext.createMediaStreamSource(this.userStream);
        await this.transcriptionContext.audioWorklet.addModule('audio-processor.js');
        this.workletNode = new AudioWorkletNode(this.transcriptionContext, 'audio-processor');
        source.connect(this.workletNode);
        this.workletNode.connect(this.transcriptionContext.destination);
        let audioBufferQueue = new Int16Array(0);
        this.workletNode.port.onmessage = (event) => {
            if (!this.ws || this.ws.readyState !== WebSocket.OPEN) return;
            const currentBuffer = new Int16Array(event.data.audio_data);
            audioBufferQueue = this.mergeBuffers(audioBufferQueue, currentBuffer);
            if ((audioBufferQueue.length / this.transcriptionContext.sampleRate) * 1000 >= 100) {
                const totalSamples = Math.floor(this.transcriptionContext.sampleRate * 0.1);
                const finalBuffer = audioBufferQueue.subarray(0, totalSamples);
                audioBufferQueue = audioBufferQueue.subarray(totalSamples);
                this.ws.send(JSON.stringify({ type: 'input_audio_buffer.append', audio: this.encodeInt16ArrayToBase64(finalBuffer) }));
            }
        };
    }

    private mergeBuffers(lhs: Int16Array, rhs: Int16Array): Int16Array {
        const mergedBuffer = new Int16Array(lhs.length + rhs.length);
        mergedBuffer.set(lhs, 0);
        mergedBuffer.set(rhs, lhs.length);
        return mergedBuffer;
    }

    private encodeInt16ArrayToBase64(int16Array: Int16Array): string {
        return btoa(String.fromCharCode(...new Uint8Array(int16Array.buffer)));
    }

    public async stop(): Promise<void> {
        this.isManualStop = true;
        this.ws?.close();
        this.ws = null;
        await this.transcriptionContext?.close();
        this.transcriptionContext = null;
        this.workletNode?.disconnect();
        this.workletNode = null;
    }

    private async reconnect() {
        if (++this.reconnectAttempts > this.maxReconnectAttempts) return;
        this.isManualStop = false;
        await this.start();
    }
}

This version of the transcriber ensures better reliability, optimized performance, and robust error handling. Let us know if you need further improvements!

Thanks for your response. The audio cutoff issue has been resolved, but for input audio transcription, we have explicitly set the language option as "en". However, it still detects various languages and provides the transcript based on the detected language. Does this setting have a different purpose? I initially assumed that setting the language to English would restrict detection to only English, but it doesn’t seem to work that way. Is my understanding correct?

Sure! Here’s your translated message for your foreign friends:


Yes, your analysis is correct. The issues faced by this developer mainly include the following:

  1. Incorrect language detection, even when speaking English

This happens because the speech model may misinterpret tones or background noise, leading to incorrect language classification.

Solution:

Set primary_language: “en” in transcription_session.update to force English as the primary language.

Explicitly specify in the prompt that only English should be transcribed:

{
“prompt”: “Transcribe the user’s speech in English only. Do not detect or switch to other languages.”
}


  1. Incomplete or cut-off transcriptions

This might be caused by the VAD (Voice Activity Detection) silence_duration_ms being too short, making the system prematurely detect silence before the speech has fully ended.

Solution:

Increase silence_duration_ms from 100ms to 250ms to give the system more time to confirm whether speech has truly ended, reducing false cut-offs.

Adjust the threshold: 0.5 value slightly (e.g., try 0.4 or 0.6) to find the best balance.


  1. Monitoring API responses for incorrect language detection

It is important to check the detected language in the handleMessage method. If the API returns the wrong language, we can send a new transcription_session.update request to force correction.

For example:

case ‘conversation.item.input_audio_transcription.delta’:
if (data?.transcript) {
console.log(“Detected transcript:”, data.transcript);
if (data?.language && data.language !== “en”) {
console.warn(“Incorrect language detected. Forcing English.”);
this.ws.send(JSON.stringify({
type: “transcription_session.update”,
session: {
input_audio_transcription: {
primary_language: “en”
}
}
}));
}
}


Final Thoughts

By implementing these improvements, this application will become more stable and reliable for real-time transcription!

Hi, I am a newbie to the realtime API and i only find the transcription API spec here https://platform.openai.com/docs/api-reference/audio/createTranscription. But it does NOT show anything related to the realtime API. From https://platform.openai.com/docs/guides/speech-to-text#streaming-the-transcription-of-an-ongoing-audio-recording, i could see some sample payload, but i still have no idea what is the format for the output? Can anyone point me to some doc or give me a sample output of the realtime API?

The solution for cut-off transcriptions worked! Thanks a lot for your help.

1 Like

Hi,

Realtime transcription issue - #9 by f10w - Try this solution for realtime speech to text. It works well for me.

Docs:
https://platform.openai.com/docs/guides/realtime
`https://platform.openai.com/docs/api-reference/realtime-sessions/create-transcription

1 Like

You’re welcome, if you have any questions, please come to me. I don’t know anyone else in the circle of friends of the developer. Sometimes I’m quite bored. I wish you a smooth work

K Nathiya via OpenAI Developer Community <notifications@openai1.discoursemail.com> 於 2025年4月3日 週四 下午4:12寫道:

Subject: Disable Caching in OpenAI Realtime API for Live Translation

Hello everyone,

I’m currently working on a real-time translation project using OpenAI’s Realtime API. My implementation involves sending audio chunks (5 to 15 seconds) for translation in real time.

While the API generally processes and returns translations correctly I’ve noticed a recurring issue:

Sometimes, OpenAI reuses the previous audio translation and transcription as a response to the newly sent chunk, even though the new chunk is entirely different.
This suggests some form of caching is occurring, where the system assumes the new input is the same as the previous one due to similar audio properties.

I tried adding an event_id to the request but it is being ignored.

Question:

How can I disable caching when creating a conversation or requesting a response to ensure every audio chunk is translated independently and not affected by previous responses?

Any insights or potential workarounds would be greatly appreciated. Thanks in advance!

Thank you so much for the reply!

Thank you so much for your kind message. I truly appreciate your support, and I’ll definitely reach out if I have any questions or need help. I hope things become more exciting on your end too! Wishing you smooth and happy days ahead at work as well.

Hi,

I have used the below code for real time transcription. I didn’t face any issue like you mentioned. Check it out here: https://community.openai.com/t/realtime-transcription-issue/1150994/10

// Request the microphone stream with current settings
  const getUserStream = async () => {
    if (userStream) return userStream; // Reuse existing stream

    try {
      console.log('Requesting microphone access with settings');
      userStream = await navigator.mediaDevices.getUserMedia({
        audio: {
          echoCancellation: true,
          noiseSuppression: true,
          autoGainControl: true
        }
      });
      console.log('Microphone access granted');
      return userStream;
    } catch (error) {
      console.error('Error accessing the microphone:', error);
      throw error;
    }
  }