Extracting Transcription Without Using input_audio.input_transcription in OpenAI API

Hi everyone,

I’m working with OpenAI’s real-time transcription API and noticed that input_audio.input_transcription provides the transcribed text. However, I want to extract the transcription using a different approach—without relying on this built-in property.

Are there any other event properties or API methods that can provide the transcribed text? Specifically, is there a way to get the transcription from the response object or another API event?

Would appreciate any insights or alternative approaches!

Thanks!

1 Like

You could try storing the audio that gets sent and passing that to the whisper API directly, or to a 3rd party transcription service.

3 Likes

Sure. OpenAI’s speech-to-text engine is called Whisper, and it has its own API endpoint, where you can send audio and receive a transcription, at a significantly lower price than audio inputs sent to GPT-4o-audio models (which are AI for actually doing more thinking and generating AI-powered chat responses.)

The transcription that you receive along with the realtime API response is actually powered by Whisper, using the AI’s generated speech as an input, so a separate call to Whisper will likely not be all that different in quality. You can also use Whisper for a transcript of what the user said as input, for presentation in a chat user interface.

1 Like

I tried using whisper-1 but the accuracy is not that good even for english too the transcription is not that good so i am looking to get a good accuracy in transcription.

Whisper is open-sourced also, so you can use a few different versions of models besides the one choice on OpenAI’s API, such as one for only English, but you would need your own AI inference server or an alternate provider capable of handling the model.

The output of Whisper is a formless string of sentences. You can have a different language AI “fix up” the appearance of the text, and also have it proofread in varying instructed degrees, with an equal chance that it corrects something or alters something in an undesired manner.

You can also process the audio more before sending it for transcription, such as normalizing the levels and filtering to just the voice passband like telephony, and see if you get better results, although the AI generation is already pretty clear. GPT-4o voice output being inherently artificial and synthesized may be the ultimate source of challenges.

Recommending products better than OpenAI is a bit beyond the scope of the OpenAI forum - and in this case, it’s also hard to make recommendations that would be definite improvements.

1 Like

Are you using the language parameter to increase accuracy?

Tried not working yet properly

Just want to ask one thing is there any audio format which is most suitable to get the perfect output as i am using client server architecture so in which when sending to the backend or for processing it’s in encoded form of base64

"use client";

import { useRef, useCallback } from "react";
import { Recorder } from "@/components/audio/recorder";
import { encode } from "base64-arraybuffer";  // ✅ Correct Base64 encoding

const BUFFER_SIZE = 4096;  // ✅ Adjusted for better performance
const SILENCE_THRESHOLD = -40; // ✅ dB level for silence detection

export default function useAudioRecorder({ onAudioRecorded, onAudioProcessingStarted }) {
    const audioRecorder = useRef(null);
    const buffer = useRef(new Uint8Array());
    const analyserNode = useRef(null); // ✅ Silence detection
    const silenceTimerRef = useRef(null);
    const audioContextRef = useRef(null);

    // ✅ Append data to buffer safely
    const appendToBuffer = (newData) => {
        const newBuffer = new Uint8Array(buffer.current.length + newData.length);
        newBuffer.set(buffer.current);
        newBuffer.set(newData, buffer.current.length);
        buffer.current = newBuffer;
    };

    // ✅ Silence Detection (Web Audio API)
    const detectSilence = (stream) => {
        const audioContext = new AudioContext();
        audioContextRef.current = audioContext;
        const source = audioContext.createMediaStreamSource(stream);
        const analyser = audioContext.createAnalyser();
        analyser.fftSize = 512;
        source.connect(analyser);
        analyserNode.current = analyser;
    };

    const handleAudioData = (data) => {
        if (silenceTimerRef.current) clearTimeout(silenceTimerRef.current);

        // ✅ Check if user is silent
        if (analyserNode.current) {
            const dataArray = new Uint8Array(analyserNode.current.frequencyBinCount);
            analyserNode.current.getByteFrequencyData(dataArray);
            const avgVolume = dataArray.reduce((a, b) => a + b, 0) / dataArray.length;
            
            if (avgVolume < SILENCE_THRESHOLD) {
                silenceTimerRef.current = setTimeout(() => {
                    onAudioProcessingStarted?.();
                }, 1000);
            }
        }

        appendToBuffer(new Uint8Array(data));

        if (buffer.current.length >= BUFFER_SIZE) {
            const toSend = new Uint8Array(buffer.current.slice(0, BUFFER_SIZE));
            buffer.current = new Uint8Array(buffer.current.slice(BUFFER_SIZE));

            // ✅ Use proper Base64 encoding
            const base64 = encode(toSend);
            onAudioRecorded(base64);
        }
    };

    const start = async () => {
        if (!audioRecorder.current) {
            audioRecorder.current = new Recorder(handleAudioData);
        }

        if (silenceTimerRef.current) {
            clearTimeout(silenceTimerRef.current);
            silenceTimerRef.current = null;
        }

        const stream = await navigator.mediaDevices.getUserMedia({ audio: true });
        detectSilence(stream);
        await audioRecorder.current.start(stream);
    };

    const stop = async () => {
        if (silenceTimerRef.current) {
            clearTimeout(silenceTimerRef.current);
            silenceTimerRef.current = null;
        }

        if (audioRecorder.current) {
            await audioRecorder.current.stop();
            audioRecorder.current = null;
            buffer.current = new Uint8Array();
        }

        if (audioContextRef.current) {
            audioContextRef.current.close();
            audioContextRef.current = null;
        }
    };

    return { start, stop };
}

Your code doesn’t show any Whisper usage, it’s just for gpt-4o audio.

Again: there is no point in using the realtime API and multimodal AI language models solely for speech-to-text. It is more expensive than Whisper and more prone to mistakes. If you want a transcription: just send to Whisper.

GPT-4o-audio on realtime gives you a choice of pcm16, or 8 bit telephony files. Use pcm16 if you don’t have a reason for the others, returning 24kHz mono audio. The parallel transcription is useful when the AI has said something new, like answering a question that was asked, not as a transcription service of input.


For Whisper input, you can also try to improve the user’s voice or the audio recording quality with audio processing. Techniques like having a user interface with microphone setup, silence removal, auto-leveling, noise cancelling - some of these could be automatic in code, and some could be done manually with audio software on files.

For Whisper, the input file is one of these formats: flac, mp3, mp4, mpeg, mpga, m4a, ogg, wav, or webm. The best quality would be to send flac or wav, which does not involve a second lossy encoding. There’s a maximum file size, though, so compression can be useful despite its lossy nature - Opus compression in an ogg file can get you to over an hour in one API call.

Actually i am doing something created a voice rag so taking input in voice but theissue in this the transcription doesn’t comes with good accuracy

And the code related to whisper is here


"use client";

import useWebSocket from "react-use-websocket";
import { useEffect, useRef } from "react";

/**
 * @typedef {Object} Parameters
 * @property {boolean} [useDirectAoaiApi]
 * @property {string} [aoaiEndpointOverride]
 * @property {string} [aoaiApiKeyOverride]
 * @property {string} [aoaiModelOverride]
 * @property {boolean} [enableInputAudioTranscription]
 * @property {Function} [onWebSocketOpen]
 * @property {Function} [onWebSocketClose]
 * @property {Function} [onWebSocketError]
 * @property {Function} [onWebSocketMessage]
 * @property {Function} [onReceivedResponseAudioDelta]
 * @property {Function} [onReceivedInputAudioBufferSpeechStarted]
 * @property {Function} [onReceivedResponseDone]
 * @property {Function} [onReceivedExtensionMiddleTierToolResponse]
 * @property {Function} [onReceivedResponseAudioTranscriptDelta]
 * @property {Function} [onReceivedInputAudioTranscriptionCompleted]
 * @property {Function} [onReceivedError]
 * @property {Function} [onReceivedMicControl]
 * @property {Function} [onReceivedResponseTextDelta]
 * @property {Function} [onReceivedResponseTextComplete] 
 * @property {Function} [onReceivedAudioFormat]
 * @property {Function} [onReceivedAudioStream]
 * @property {Function} [onReceivedAudioComplete]
 * @property {Function} [onReceivedSpeakerStatus]
 * @property {Function} [onReceivedLanguageStatus]
 * @property {boolean} [shouldConnect]
 */
export default function useRealTime({
    useDirectAoaiApi,
    aoaiEndpointOverride,
    aoaiApiKeyOverride,
    aoaiModelOverride,
    enableInputAudioTranscription,
    onWebSocketOpen,
    onWebSocketClose,
    onWebSocketError,
    onWebSocketMessage,
    onReceivedResponseDone,
    onReceivedResponseAudioDelta,
    onReceivedResponseAudioTranscriptDelta,
    onReceivedInputAudioBufferSpeechStarted,
    onReceivedExtensionMiddleTierToolResponse,
    onReceivedInputAudioTranscriptionCompleted,
    onReceivedError,
    onReceivedMicControl,
    onReceivedResponseTextDelta,
    onReceivedResponseTextComplete, 
    onReceivedAudioFormat,
    onReceivedAudioStream,
    onReceivedAudioComplete,
    onReceivedSpeakerStatus,
    onReceivedLanguageStatus,
    shouldConnect = false
}) {
    const wsEndpoint = useDirectAoaiApi
        ? `${aoaiEndpointOverride}/openai/realtime?api-key=${aoaiApiKeyOverride}&deployment=${aoaiModelOverride}&api-version=2024-10-01-preview`
        : `${process.env.NEXT_PUBLIC_WS_URL}/realtime`;

    const { sendJsonMessage, readyState, lastJsonMessage } = useWebSocket(wsEndpoint, {
        onOpen: () => {
            onWebSocketOpen?.();
        },
        onClose: () => onWebSocketClose?.(),
        onError: event => onWebSocketError?.(event),
        onMessage: event => {
            onMessageReceived(event);
            onWebSocketMessage?.(event);
        },
        shouldReconnect: () => shouldConnect,
        reconnectAttempts: 10,
        reconnectInterval: 3000,
    }, 
    shouldConnect
    );

    // Simplified safeSendJsonMessage that doesn't rely on ReadyState
    const safeSendJsonMessage = (message) => {
        try {
            sendJsonMessage(message);
        } catch (error) {
            console.error('Error sending message:', error);
        }
    };

    const startSession = (userLanguage = "en") => {
        const command = {
            type: "session.update",
            session: {
                turn_detection: {
                    type: "server_vad"
                },
                input_audio_transcription: {
                    model: "whisper-1",
                    language: userLanguage
                }
            }
        };
        safeSendJsonMessage(command);
    };

    const addUserAudio = (base64Audio) => {
        if (!base64Audio || typeof base64Audio !== "string") {
            console.error("Invalid base64Audio data:", base64Audio);
            return;
        }
    
        const command = {
            type: "input_audio_buffer.append",
            audio: base64Audio
        };
    
        console.log("Sending audio data:", command);
        safeSendJsonMessage(command);
    };

    const inputAudioBufferClear = () => {
        const command = {
            type: "input_audio_buffer.clear"
        };

        safeSendJsonMessage(command);
    };

    const stopSession = () => {
        inputAudioBufferClear();
    };

    const sendTextInput = async (text) => {
        if (text.trim()) {
            const textCommand = {
                type: "conversation.item.create",
                item: {
                    type: "message",
                    role: "user",
                    content: [
                        {
                            type: "input_text",
                            text: text
                        }
                    ]
                }
            };
            safeSendJsonMessage(textCommand);

            const responseCommand = {
                type: "response.create"
            };
            safeSendJsonMessage(responseCommand);
        }
    };

    const onMessageReceived = (event) => {
        try {
            const message = JSON.parse(event.data);
            // [TEMPORARY DEBUGGING CODE - REMOVE AFTER TESTING]
            // Log all message types to help troubleshoot
            console.log(`WebSocket received message type: ${message.type}`, message);
            
            switch (message.type) {
                case "response.done":
                    onReceivedResponseDone?.(message);
                    break;
                case "response.audio.delta":
                    onReceivedResponseAudioDelta?.(message);
                    break;
                case "response.audio_transcript.delta":
                    onReceivedResponseAudioTranscriptDelta?.(message);
                    break;
                case "response.text.delta":
                    onReceivedResponseTextDelta?.(message.delta);
                    break;
                case "response.text.done":
                    if (message.text) {
                        onReceivedResponseTextDelta?.(message.text);
                    }
                    break;
                case "input_audio_buffer.speech_started":
                    onReceivedInputAudioBufferSpeechStarted?.(message);
                    break;
                case "mic_control":
                    // Handle microphone control messages from the backend
                    onReceivedMicControl?.(message);
                    break;
                case "speaker.status":
                    // Handle speaker status update from the backend
                    onReceivedSpeakerStatus?.(message);
                    break;
                case "language.status":
                    // Handle language status update from the backend
                    onReceivedLanguageStatus?.(message);
                    break;
                case "conversation.item.input_audio_transcription.completed":
                    if (message.item?.content?.[0]?.transcript) {
                        console.log("Final transcript:", message.item.content[0].transcript);
                        onReceivedInputAudioTranscriptionCompleted?.({
                            transcript: message.item.content[0].transcript
                        });
                    } else {
                        console.warn("Received transcription event but no transcript found", message);
                    }
                    break;
                case "conversation.item.create":
                    // [TEMPORARY DEBUGGING CODE - REMOVE AFTER TESTING]
                    console.log("DEBUG: Received conversation item create event", message);
                    if (message.item?.content?.[0]?.transcript) {
                        // [TEMPORARY DEBUGGING CODE - REMOVE AFTER TESTING]
                        console.log("DEBUG: Found transcript in conversation item:", message.item.content[0].transcript);
                        onReceivedInputAudioTranscriptionCompleted?.({
                            transcript: message.item.content[0].transcript
                        });
                    } else {
                        // [TEMPORARY DEBUGGING CODE - REMOVE AFTER TESTING]
                        console.warn("DEBUG: Received conversation item create but no transcript found", message);
                    }
                    break;
                case "extension.middle_tier_tool_response":
                    onReceivedExtensionMiddleTierToolResponse?.(message);
                    break;
                case "bot.text.complete":
                    onReceivedResponseTextComplete?.(message);
                    break;
                case "bot.audio.format":
                    onReceivedAudioFormat?.(message);
                    break;
                case "bot.audio.stream":
                    onReceivedAudioStream?.(message);
                    break;
                case "bot.audio.complete":
                    onReceivedAudioComplete?.(message);
                    break;
                case "error":
                    // Only process if there's actual content
                    if (message && Object.keys(message).length > 0 && 
                        Object.keys(message).some(key => key !== 'type')) {
                        onReceivedError?.(message);
                    }
                    break;
            }
        } catch (e) {
            console.error('Error parsing WebSocket message:', e);
        }
    };

    return { 
        startSession, 
        addUserAudio, 
        inputAudioBufferClear,
        stopSession,
        sendTextInput,
        sendJsonMessage: safeSendJsonMessage
    };
}

1 Like

With the voice models, there are some options you could play with to increase the ability to transcribe output - and to get good quality output.

How about the voice you choose? It is possible that one voice is an easier source for transcriptions, especially considering the variety ranging from male to female personas.

You can also instruct the tone, giving instructions that the style of responses is professional, slow, articulate, clear, less emotional, etc. The AI model following system instructions might actually be more well-spoken.

The temperature parameter at the lowest allowed, 0.6, can improve audio prediction. Using audio models on chat completions, you can go lower, until the voices dramatically malfunction for some reason at low temperature settings, and higher, where the voices start having strange artifacts and are far more “creative” in the language used.

If the best transcription is not what you want to be heard: double your cost and run it twice. However, the AI might say something different the second time.