Realtime transcription issue

I’m trying to transcribe audio using a WebSocket connection. The transcription session is successfully created, but I am not receiving the transcription text. Could you please guide me in resolving this issue?

this.ws = new WebSocket(`wss://api.openai.com/v1/realtime?intent=transcription`, [
                "realtime",
                `openai-insecure-api-key.${token}`,
                "openai-beta.realtime-v1"
            ]);
this.ws.onopen = () => {
                console.log('Connected to OpenAI realtime API');
                // Send configuration once connected
                
            };

            this.ws.onmessage = (event: MessageEvent) => {
console.log(event);
}

 audioWorkletNode.port.onmessage = (event) => {
                if (!this.ws || this.ws.readyState !== WebSocket.OPEN) return;

                const inputData = event.data.audio_data;
                // console.log(inputData)
                if (!inputData || inputData.length === 0) {
                    console.warn('Received empty audio data');
                    return;
                }

                const currentBuffer = new Int16Array(event.data.audio_data);

                // Use type assertion to assure TypeScript this is compatible
                audioBufferQueue = this.mergeBuffers(
                    audioBufferQueue,
                    currentBuffer
                );
                const bufferDuration =
                    (audioBufferQueue.length / this.transcriptionContext.sampleRate) * 1000;

                // wait until we have 100ms of audio data
                if (bufferDuration >= 100) {
                    const totalSamples = Math.floor(this.transcriptionContext.sampleRate * 0.1);

                    // Extract the portion we want to send
                    const dataToSend = audioBufferQueue.subarray(0, totalSamples);

                    // Encode the Int16Array to base64
                    const base64Audio = this.encodeInt16ArrayToBase64(dataToSend);

                    // Update our queue to remove the sent data
                    audioBufferQueue = audioBufferQueue.subarray(totalSamples);
                    // Convert to the format OpenAI expects (16-bit PCM)
                    // const audioBuffer = this.floatTo16BitPCM(finalBuffer);
                    // const base64Audio = this.arrayBufferToBase64(audioBuffer);


                    // Send the audio data to OpenAI
                    this.ws.send(JSON.stringify({
                        type: 'input_audio_buffer.append',
                        audio: base64Audio
                    }));
                    // this.ws.send(JSON.stringify({
                    //     type: 'response.create',
                    // }));
                }
            };

Here I have attached the screenshot of the log also. I couldn’t able to update the session to use gpt-40-mini-transcribe model. I want to use this feature in production site. Could you please guide me to resolve this issue?

1 Like

I have successfully connected the transcription session and got the transcript. I’m encountering the following issues:

OpenAI Transcription automatically detects the spoken language and returns the transcript accordingly. However, during testing, I noticed that even when speaking in English, it sometimes detects the language incorrectly—like Spanish or French —and returns the transcript in that language. Additionally, in some cases, the transcript is incomplete or gets cut off.

I have used the Server_VAD for voice activity detection. Anyone guide me to resolve it.

Hi Nathiya, here’s my code functional code, hope it helps !

import os
import json
import base64
import asyncio
import logging
import aiohttp
import websockets
from dotenv import load_dotenv

load_dotenv()
OPENAI_API_KEY = os.getenv("OPENAI_API_KEY")
if not OPENAI_API_KEY:
    raise ValueError("Missing OpenAI API key.")

logging.basicConfig(level=logging.DEBUG, format="%(asctime)s [%(levelname)s] %(message)s")
logger = logging.getLogger(__name__)

final_transcription = ""

async def create_transcription_session():
    """
    Create a transcription session via the REST API to obtain an ephemeral token.
    This endpoint uses the beta header "OpenAI-Beta: assistants=v2".
    """
    url = "https://api.openai.com/v1/realtime/transcription_sessions"
    payload = {
        "input_audio_format": "g711_ulaw",
        "input_audio_transcription": {
            "model": "gpt-4o-transcribe",
            "language": "en",
            "prompt": "Transcribe the incoming audio in real time."
        },
    
        "turn_detection": {"type": "server_vad", "silence_duration_ms": 1000}
    }
    headers = {
        "Authorization": f"Bearer {OPENAI_API_KEY}",
        "Content-Type": "application/json",
        "OpenAI-Beta": "assistants=v2"
    }
    async with aiohttp.ClientSession() as session:
        async with session.post(url, json=payload, headers=headers) as resp:
            if resp.status != 200:
                text = await resp.text()
                raise Exception(f"Failed to create transcription session: {resp.status} {text}")
            data = await resp.json()
            ephemeral_token = data["client_secret"]["value"]
            logger.info("Transcription session created; ephemeral token obtained.")
            return ephemeral_token

async def send_audio(ws, file_path: str, chunk_size: int, speech_stopped_event: asyncio.Event):
    """
    Read the local ulaw file and send it in chunks.
    After finishing, wait for 1 second to see if the server auto-commits.
    If not, send a commit event manually.
    """
    try:
        with open(file_path, "rb") as f:
            while True:
                chunk = f.read(chunk_size)
                if not chunk:
                    break
                # Base64-encode the audio chunk.
                audio_chunk = base64.b64encode(chunk).decode("utf-8")
                audio_event = {
                    "type": "input_audio_buffer.append",
                    "audio": audio_chunk
                }
                await ws.send(json.dumps(audio_event))
                await asyncio.sleep(0.02)  # simulate real-time streaming
        logger.info("Finished sending audio file.")

        # Wait 1 second to allow any late VAD events before committing.
        try:
            await asyncio.wait_for(speech_stopped_event.wait(), timeout=1.0)
            logger.debug("Speech stopped event received; no manual commit needed.")
        except asyncio.TimeoutError:
            commit_event = {"type": "input_audio_buffer.commit"}
            await ws.send(json.dumps(commit_event))
            logger.info("Manually sent input_audio_buffer.commit event.")
    except FileNotFoundError:
        logger.error(f"Audio file not found: {file_path}")
    except Exception as e:
        logger.error("Error sending audio: %s", e)

async def receive_events(ws, speech_stopped_event: asyncio.Event):
    """
    Listen for events from the realtime endpoint.
    Capture transcription deltas and the final complete transcription.
    Set the speech_stopped_event when a "speech_stopped" event is received.
    """
    global final_transcription
    try:
        async for message in ws:
            try:
                event = json.loads(message)
                event_type = event.get("type")
                if event_type == "input_audio_buffer.speech_stopped":
                    logger.debug("Received event: input_audio_buffer.speech_stopped")
                    speech_stopped_event.set()
                elif event_type == "conversation.item.input_audio_transcription.delta":
                    delta = event.get("delta", "")
                    logger.info("Transcription delta: %s", delta)
                    final_transcription += delta
                elif event_type == "conversation.item.input_audio_transcription.completed":
                    completed_text = event.get("transcript", "")
                    logger.info("Final transcription completed: %s", completed_text)
                    final_transcription = completed_text  # Use the completed transcript
                    break  # Exit after final transcription
                elif event_type == "error":
                    logger.error("Error event: %s", event.get("error"))
                else:
                    logger.debug("Received event: %s", event_type)
            except Exception as ex:
                logger.error("Error processing message: %s", ex)
    except Exception as e:
        logger.error("Error receiving events: %s", e)

async def test_transcription():
    try:
        # Step 1: Create transcription session and get ephemeral token.
        ephemeral_token = await create_transcription_session()

        # Step 2: Connect to the base realtime endpoint.
        websocket_url = "wss://api.openai.com/v1/realtime"
        connection_headers = {
            "Authorization": f"Bearer {ephemeral_token}",
            "OpenAI-Beta": "realtime=v1"
        }
        async with websockets.connect(websocket_url, additional_headers=connection_headers) as ws:
            logger.info("Connected to realtime endpoint.")

            # Step 3: Send transcription session update event with adjusted VAD settings.
            update_event = {
                "type": "transcription_session.update",
                "session": {
                    "input_audio_transcription": {
                        "model": "gpt-4o-transcribe",
                        "language": "en",
                        "prompt": "Transcribe the incoming audio in real time."
                    },
                    # Matching the REST API settings
                    "turn_detection": {"type": "server_vad", "silence_duration_ms": 1000}
                }
            }
            await ws.send(json.dumps(update_event))
            logger.info("Sent transcription session update event.")

            # Create an event to signal if speech stopped is detected.
            speech_stopped_event = asyncio.Event()

            # Step 4: Run sender and receiver concurrently.
            sender_task = asyncio.create_task(send_audio(ws, "static/Welcome.ulaw", 1024, speech_stopped_event))
            receiver_task = asyncio.create_task(receive_events(ws, speech_stopped_event))
            await asyncio.gather(sender_task, receiver_task)

            # Print the final transcription.
            logger.info("Final complete transcription: %s", final_transcription)
            print("Final complete transcription:")
            print(final_transcription)

    except Exception as e:
        logger.error("Error in transcription test: %s", e)

if __name__ == "__main__":
    asyncio.run(test_transcription())

// Crear una conexión WebSocket
this.ws = new WebSocket(wss://api.openai.com/v1/realtime?intent=transcription, [
“realtime”,
openai-insecure-api-key.${token},
“openai-beta.realtime-v1”
]);

// Procesamiento cuando se establece la conexión WebSocket
this.ws.onopen = () => {
console.log (‘conectado a la API real en tiempo real de OpenAI’);

// Recibe el mensaje en caché de LocalStorage y envíelo
Let CachedMessages = json.parse (localStorage.getItem ('CachedMessages') || '[]');
CachedMessages.ForEach (msg => this.ws.send (msg));

// borrar el caché después de enviar
localStorage.removeItem ('CachedMessages');

};

// Procesamiento al recibir mensajes de WebSocket
this.ws.onmessage = (event) => {
console.log (evento);
};

// Manejo cuando la conexión WebSocket está cerrada
this.ws.onclose = () => {
console.log (‘Conexión WebSocket cerrada’);

// Registre el tiempo de desconexión para evitar conexiones pesadas frecuentes
localStorage.SetItem ('lastdisconnect', date.now ());

};

// Manejo de errores de WebSocket
this.ws.onerror = (error) => {
console.error(‘WebSocket error:’, error);
};

// Audio data processing
audioWorkletNode.port.onmessage = (event) => {
if (!this.ws || this.ws.readyState !== WebSocket.OPEN) return;

const inputData = event.data.audio_data;
if (!inputData || inputData.length === 0) {
    console.warn('Received empty audio data');
    return;
}

const currentBuffer = new Int16Array(inputData);

// Merge audio data
audioBufferQueue = this.mergeBuffers(audioBufferQueue, currentBuffer);
const bufferDuration = (audioBufferQueue.length / this.transcriptionContext.sampleRate) * 1000;

// Make sure the audio data is at least 100 milliseconds
if (bufferDuration >= 100) {
    const totalSamples = Math.floor(this.transcriptionContext.sampleRate * 0.1);

    // Extract audio data to be sent
    const dataToSend = audioBufferQueue.subarray(0, totalSamples);

    // Encode audio data to base64
    const base64Audio = this.encodeInt16ArrayToBase64(dataToSend);

    // Update the queue and remove sent data
    audioBufferQueue = audioBufferQueue.subarray(totalSamples);

    // Send audio data to OpenAI
    this.ws.send(JSON.stringify({
        type: 'input_audio_buffer.append',
        audio: base64Audio
    }));
}

};

// Use HTTP request to send data when WebSocket is disconnected
function sendMessage(data) {
if (this.ws.readyState === WebSocket.OPEN) {
this.ws.send(data);
} else {
console.warn(‘WebSocket disconnected, using HTTP to send data’);

    //Cached data, waiting for WebSocket to recover
    let cachedMessages = JSON.parse(localStorage.getItem('cachedMessages') || '[]');
    cachedMessages.push(data);
    localStorage.setItem('cachedMessages', JSON.stringify(cachedMessages));

    // Use fetch to send data
    fetch('https://example.com/api/fallback', {
        method: 'POST',
        body: JSON.stringify(data),
        headers: { 'Content-Type': 'application/json' }
    }).then(response => response.json())
        .then(result => console.log('HTTP backup transfer succeeded:', result))
        .catch(error => console.error('HTTP transfer failed:', error));
}

}

// Use sendBeacon to send the last data when the page is closed
window.addEventListener(‘beforeunload’, () => {
let cachedMessages = localStorage.getItem(‘cachedMessages’);
if (cachedMessages) {
console.log(‘Sending remaining data before page unload’);
navigator.sendBeacon(‘https://example.com/api/analytics’, cachedMessages);
}
});

WebSocket Connection Management:

We established a WebSocket connection according to the original code and listened to the onopen, onmessage, onclose and onerror events of the connection.

In the onopen event, we get the cached message from localStorage and send it, clear the cache after the connection is restored.

Audio data transmission:

Use audioWorkletNode.port.onmessage to monitor the transfer of audio data. When receiving audio data, we ensure that the duration of the audio data is greater than or equal to 100ms before sending it.

After ensuring sufficient data, encode the audio data to Base64 and send it to WebSocket.

Use HTTP transport when WebSocket is disconnected:

When WebSocket is disconnected, we use the fetch method to send audio data to the standby server over HTTP requests. At the same time, cache this data and continue to be sent when WebSocket is restored.

This ensures that data is not lost when the WebSocket connection is disconnected.

The last data sent when the page is closed:

Use navigator.sendBeacon to send cached message data when the page is uninstalled to ensure that the last data can be sent out.

Key points:
Caching mechanism: cache unsent data through localStorage to ensure that WebSocket continues to be sent after recovery.

Backup mechanism: When WebSocket is disconnected, HTTP requests are used as the backup transmission method to ensure reliable data transmission.

Guarantees when uninstalling the page: Use sendBeacon to ensure that data can be sent out even when the page is closed.

Hi Tony, thanks for your reply. I tested the sample code you provided. It works good. But I’m trying to do real time transcription. Here is my code.

interface TranscriberCallbacks {
    onInterimTranscript: (text: string) => void;
    onFinalTranscript: (text: string) => void;
    onError: (error: string) => void;
}

export default class OpenAITranscriber {
    private ws: WebSocket | null = null;
    private transcriptionContext: AudioContext | null = null;
    private workletNode: AudioWorkletNode | null = null;
    private isManualStop: boolean = false;
    private sessionTimeout: number;
    private userStream: MediaStream;
    private onInterimTranscript: (text: string) => void;
    private onFinalTranscript: (text: string) => void;
    private onError: (error: string) => void;
    private reconnectAttempts = 0;
    private maxReconnectAttempts = 3;


    constructor(
        sessionTimeout: number,
        userStream: MediaStream,
        callbacks: TranscriberCallbacks
    ) {
        this.sessionTimeout = sessionTimeout;
        this.userStream = userStream;
        this.onInterimTranscript = callbacks.onInterimTranscript;
        this.onFinalTranscript = callbacks.onFinalTranscript;
        this.onError = callbacks.onError;
    }

    private async fetchAccessToken(): Promise<string> {
        try {
            const expiresIn = Math.floor(this.sessionTimeout / 1000); // Convert to seconds
            const response = await fetch('/api/openai-token', {
                method: 'POST',
                headers: {
                    'Content-Type': 'application/json',
                },
                body: JSON.stringify({
                    expiresIn,
                }),
            });

            if (response.ok) {
                const data = await response.json();
                console.log(data);
                return data.client_secret.value;
            }
            else {
                const data = await response.text();
                throw new Error(`${JSON.parse(data).error}`);
            }
        } catch (error) {
            console.error(`Error fetching access token for STT: ${error}`);
            throw error;
        }
    }

    public async start(): Promise<void> {
        if (this.ws) {
            console.warn('Transcription is already in progress');
            return;
        }

        try {
            const token = await this.fetchAccessToken();
            if (!token) throw new Error('Failed to fetch access token');

            // Connect to OpenAI's realtime API
            this.ws = new WebSocket(`wss://api.openai.com/v1/realtime?intent=transcription`, [
                "realtime",
                `openai-insecure-api-key.${token}`,
                "openai-beta.realtime-v1"
            ]);


            this.ws.onopen = () => {
                console.log('Connected to OpenAI realtime API');
                this.reconnectAttempts = 0;
                if (this.ws?.readyState === 1) {
                    this.ws.send(JSON.stringify({
                        type: "transcription_session.update",
                        session: {
                            input_audio_transcription: {
                                model: "gpt-4o-mini-transcribe",
                                language: "en",
                                // prompt: "Transcribe the user's speech and translate it into English accurately while maintaining the original meaning and correct grammar."
                            },
                            turn_detection: {
                                prefix_padding_ms: 600,
                                silence_duration_ms: 800,
                                type: "server_vad",
                                threshold: 0.5
                            }
                        },
                    }));
                }

            };

            this.ws.onmessage = (event: MessageEvent) => {
                const data = JSON.parse(event.data);
                this.handleMessage(data);
            };

            this.ws.onerror = (error: Event) => {
                console.error('WebSocket error:', error);
                this.onError(`Connection error: ${error}`);

                this.stop();
                this.reconnect();
            };

            this.ws.onclose = () => {
                console.log('WebSocket connection closed');
                if (!this.isManualStop) {
                    this.stop();
                    this.reconnect();
                }
            };

            await this.initAudioProcessing();

        } catch (error) {
            console.error('Error starting transcription:', error);
            this.onError(error instanceof Error ? error.message : 'Unknown error');
            this.stop();
        }
    }

    private handleMessage(data: any): void {
        switch (data.type) {
            case 'transcription_session.created':
                if (data?.session)
                    console.log("[transcription_session.created]", "Audio format: ", data.session.input_audio_format, "Expires at: ", data.session.expires_at, "Silence duration in ms: ", data.session.turn_detection.silence_duration_ms);
                break;

            case 'transcription_session.updated':
                if (data?.session)
                    console.log("[transcription_session.updated]", "Audio format: ", data.session.input_audio_format, "Expires at: ", data.session.expires_at, "Model: ", data.session.input_audio_transcription.model, "Silence duration in ms: ", data.session.turn_detection.silence_duration_ms);
                break;

            case 'conversation.item.created':
                if (data?.session)
                    console.log("[conversation.item.created]", "Conversation Status: ", data.item.status);
                break;

            case 'conversation.item.input_audio_transcription.delta':
                if (data?.transcript) {
                    console.log("[conversation.item.input_audio_transcription.delta]", "Transcript: ", data.transcript);
                    // this.onInterimTranscript(data.transcript);
                }
                break;

            case 'conversation.item.input_audio_transcription.completed':
                if (data?.transcript) {
                    console.log("[conversation.item.input_audio_transcription.completed]", "Transcript: ", data.transcript);
                    this.onFinalTranscript(data.transcript);
                }
                break;

            case 'error':
                if (data?.error) {
                    console.log("[error]", "Type: ", data.error.type, "Code: ", data.error.code, "Error: ", data.error.message);
                    this.onError(data.error.message);
                }
                break;

            default:
                console.log('Unhandled message type:', data.type);
        }
    }

    private async initAudioProcessing(): Promise<void> {
        try {
            this.transcriptionContext = new AudioContext({
                sampleRate: 24000,
                latencyHint: 'balanced'
            });
            const source = this.transcriptionContext.createMediaStreamSource(this.userStream);

            await this.transcriptionContext.audioWorklet.addModule('audio-processor.js');
            this.workletNode = new AudioWorkletNode(this.transcriptionContext, 'audio-processor');

            source.connect(this.workletNode);
            this.workletNode.connect(this.transcriptionContext.destination);

            let audioBufferQueue = new Int16Array(0);
            // Process audio data
            this.workletNode.port.onmessage = (event) => {

                if (!this.ws || this.ws.readyState !== WebSocket.OPEN) return;

                const currentBuffer = new Int16Array(event.data.audio_data);

                // Use type assertion to assure TypeScript this is compatible
                audioBufferQueue = this.mergeBuffers(
                    audioBufferQueue,
                    currentBuffer
                );
                const bufferDuration =
                    (audioBufferQueue.length / this.transcriptionContext.sampleRate) * 1000;

                // wait until we have 100ms of audio data
                if (bufferDuration >= 100) {
                    const totalSamples = Math.floor(this.transcriptionContext.sampleRate * 0.1);

                    // Extract the portion we want to send
                    const finalBuffer = audioBufferQueue.subarray(0, totalSamples);

                    // Update our queue to remove the sent data
                    audioBufferQueue = audioBufferQueue.subarray(totalSamples);

                    // Encode the Int16Array to base64
                    const base64Audio1 = this.encodeInt16ArrayToBase64(finalBuffer);

                    // Send the audio data to OpenAI
                    this.ws.send(JSON.stringify({
                        type: 'input_audio_buffer.append',
                        audio: base64Audio1
                    }));
                }
            };
        } catch (error) {
            console.error('Error initializing audio processing:', error);
            throw error;
        }
    }

    private mergeBuffers(lhs: Int16Array, rhs: Int16Array): Int16Array {
        const mergedBuffer = new Int16Array(lhs.length + rhs.length);
        mergedBuffer.set(lhs, 0);
        mergedBuffer.set(rhs, lhs.length);
        return mergedBuffer;
    }

    private encodeInt16ArrayToBase64(int16Array: Int16Array): string {
        // Directly use the Int16Array's underlying buffer
        const bytes = new Uint8Array(int16Array.buffer);

        // Chunked processing for large arrays (avoids call stack limits)
        const chunkSize = 0x8000; // 32KB chunks
        let binary = '';

        for (let i = 0; i < bytes.length; i += chunkSize) {
            const chunk = bytes.subarray(i, Math.min(i + chunkSize, bytes.length));
            binary += String.fromCharCode(...chunk); // Spread operator
        }

        return btoa(binary);
    }

    public async stop(): Promise<void> {
        this.isManualStop = true; // Set the manual stop flag

        if (this.ws) {
            this.ws.close();
            this.ws = null;
        }

        if (this.transcriptionContext) {
            await this.transcriptionContext.close();
            this.transcriptionContext = null;
        }

        if (this.workletNode) {
            this.workletNode.port.onmessage = null;
            this.workletNode.disconnect();
            this.workletNode = null;
        }
    }

    private async reconnect() {
        if (this.reconnectAttempts >= this.maxReconnectAttempts) {
            console.log('Max reconnect attempts reached. Stopping further attempts.');
            return; // Stop further reconnection attempts
        }
        console.log(`Attempting to reconnect... (${this.reconnectAttempts + 1}/${this.maxReconnectAttempts})`);
        this.isManualStop = false; // Reset the manual stop flag
        this.reconnectAttempts++; // Increment attempt counter
        try {
            await this.start();
        } catch (error) {
            console.error(`Reconnect attempt ${this.reconnectAttempts} failed:`, error);
        }
    }
}

I’m developing a customer support application that requires real-time transcription. The code is functioning correctly, and I have successfully converted the microphone stream into base64. However, I’m facing an issue where the language detection is inconsistent. Even when speaking in English, it sometimes detects a different language—such as Spanish or French—and returns the transcript accordingly. Additionally, in some cases, the transcription is incomplete or gets cut off. Once again thanks for your reply.

Hi,

Thanks for your reply. I have shared my code and the issues. Could you please check it?

The correct URL is "wss://api.openai.com/v1/realtime?intent=transcription" (you missed intent= transcription ).

I have used the correct URL. Please check my code.

Sorry I was replying to @TonyStark, you can see on the top right of my message (the updated UI is very bad).