Extracting Transcription Without Using input_audio.input_transcription in OpenAI API

toontown · March 11, 2025, 5:08am

Hi everyone,

I’m working with OpenAI’s real-time transcription API and noticed that input_audio.input_transcription provides the transcribed text. However, I want to extract the transcription using a different approach—without relying on this built-in property.

Are there any other event properties or API methods that can provide the transcribed text? Specifically, is there a way to get the transcription from the response object or another API event?

Would appreciate any insights or alternative approaches!

Thanks!

Foxalabs · March 11, 2025, 5:19am

You could try storing the audio that gets sent and passing that to the whisper API directly, or to a 3rd party transcription service.

_j · March 11, 2025, 5:19am

Sure. OpenAI’s speech-to-text engine is called Whisper, and it has its own API endpoint, where you can send audio and receive a transcription, at a significantly lower price than audio inputs sent to GPT-4o-audio models (which are AI for actually doing more thinking and generating AI-powered chat responses.)

The transcription that you receive along with the realtime API response is actually powered by Whisper, using the AI’s generated speech as an input, so a separate call to Whisper will likely not be all that different in quality. You can also use Whisper for a transcript of what the user said as input, for presentation in a chat user interface.

toontown · March 11, 2025, 5:27am

I tried using whisper-1 but the accuracy is not that good even for english too the transcription is not that good so i am looking to get a good accuracy in transcription.

_j · March 11, 2025, 5:53am

Whisper is open-sourced also, so you can use a few different versions of models besides the one choice on OpenAI’s API, such as one for only English, but you would need your own AI inference server or an alternate provider capable of handling the model.

The output of Whisper is a formless string of sentences. You can have a different language AI “fix up” the appearance of the text, and also have it proofread in varying instructed degrees, with an equal chance that it corrects something or alters something in an undesired manner.

You can also process the audio more before sending it for transcription, such as normalizing the levels and filtering to just the voice passband like telephony, and see if you get better results, although the AI generation is already pretty clear. GPT-4o voice output being inherently artificial and synthesized may be the ultimate source of challenges.

Recommending products better than OpenAI is a bit beyond the scope of the OpenAI forum - and in this case, it’s also hard to make recommendations that would be definite improvements.

sps · March 11, 2025, 6:10am

Are you using the language parameter to increase accuracy?

toontown · March 11, 2025, 6:12am

Tried not working yet properly

toontown · March 11, 2025, 6:14am

Just want to ask one thing is there any audio format which is most suitable to get the perfect output as i am using client server architecture so in which when sending to the backend or for processing it’s in encoded form of base64

"use client";

import { useRef, useCallback } from "react";
import { Recorder } from "@/components/audio/recorder";
import { encode } from "base64-arraybuffer";  // ✅ Correct Base64 encoding

const BUFFER_SIZE = 4096;  // ✅ Adjusted for better performance
const SILENCE_THRESHOLD = -40; // ✅ dB level for silence detection

export default function useAudioRecorder({ onAudioRecorded, onAudioProcessingStarted }) {
    const audioRecorder = useRef(null);
    const buffer = useRef(new Uint8Array());
    const analyserNode = useRef(null); // ✅ Silence detection
    const silenceTimerRef = useRef(null);
    const audioContextRef = useRef(null);

    // ✅ Append data to buffer safely
    const appendToBuffer = (newData) => {
        const newBuffer = new Uint8Array(buffer.current.length + newData.length);
        newBuffer.set(buffer.current);
        newBuffer.set(newData, buffer.current.length);
        buffer.current = newBuffer;
    };

    // ✅ Silence Detection (Web Audio API)
    const detectSilence = (stream) => {
        const audioContext = new AudioContext();
        audioContextRef.current = audioContext;
        const source = audioContext.createMediaStreamSource(stream);
        const analyser = audioContext.createAnalyser();
        analyser.fftSize = 512;
        source.connect(analyser);
        analyserNode.current = analyser;
    };

    const handleAudioData = (data) => {
        if (silenceTimerRef.current) clearTimeout(silenceTimerRef.current);

        // ✅ Check if user is silent
        if (analyserNode.current) {
            const dataArray = new Uint8Array(analyserNode.current.frequencyBinCount);
            analyserNode.current.getByteFrequencyData(dataArray);
            const avgVolume = dataArray.reduce((a, b) => a + b, 0) / dataArray.length;
            
            if (avgVolume < SILENCE_THRESHOLD) {
                silenceTimerRef.current = setTimeout(() => {
                    onAudioProcessingStarted?.();
                }, 1000);
            }
        }

        appendToBuffer(new Uint8Array(data));

        if (buffer.current.length >= BUFFER_SIZE) {
            const toSend = new Uint8Array(buffer.current.slice(0, BUFFER_SIZE));
            buffer.current = new Uint8Array(buffer.current.slice(BUFFER_SIZE));

            // ✅ Use proper Base64 encoding
            const base64 = encode(toSend);
            onAudioRecorded(base64);
        }
    };

    const start = async () => {
        if (!audioRecorder.current) {
            audioRecorder.current = new Recorder(handleAudioData);
        }

        if (silenceTimerRef.current) {
            clearTimeout(silenceTimerRef.current);
            silenceTimerRef.current = null;
        }

        const stream = await navigator.mediaDevices.getUserMedia({ audio: true });
        detectSilence(stream);
        await audioRecorder.current.start(stream);
    };

    const stop = async () => {
        if (silenceTimerRef.current) {
            clearTimeout(silenceTimerRef.current);
            silenceTimerRef.current = null;
        }

        if (audioRecorder.current) {
            await audioRecorder.current.stop();
            audioRecorder.current = null;
            buffer.current = new Uint8Array();
        }

        if (audioContextRef.current) {
            audioContextRef.current.close();
            audioContextRef.current = null;
        }
    };

    return { start, stop };
}

_j · March 11, 2025, 6:38am

Your code doesn’t show any Whisper usage, it’s just for gpt-4o audio.

Again: there is no point in using the realtime API and multimodal AI language models solely for speech-to-text. It is more expensive than Whisper and more prone to mistakes. If you want a transcription: just send to Whisper.

GPT-4o-audio on realtime gives you a choice of pcm16, or 8 bit telephony files. Use pcm16 if you don’t have a reason for the others, returning 24kHz mono audio. The parallel transcription is useful when the AI has said something new, like answering a question that was asked, not as a transcription service of input.

For Whisper input, you can also try to improve the user’s voice or the audio recording quality with audio processing. Techniques like having a user interface with microphone setup, silence removal, auto-leveling, noise cancelling - some of these could be automatic in code, and some could be done manually with audio software on files.

For Whisper, the input file is one of these formats: flac, mp3, mp4, mpeg, mpga, m4a, ogg, wav, or webm. The best quality would be to send flac or wav, which does not involve a second lossy encoding. There’s a maximum file size, though, so compression can be useful despite its lossy nature - Opus compression in an ogg file can get you to over an hour in one API call.

toontown · March 11, 2025, 6:40am

Actually i am doing something created a voice rag so taking input in voice but theissue in this the transcription doesn’t comes with good accuracy

And the code related to whisper is here


"use client";

import useWebSocket from "react-use-websocket";
import { useEffect, useRef } from "react";

/**
 * @typedef {Object} Parameters
 * @property {boolean} [useDirectAoaiApi]
 * @property {string} [aoaiEndpointOverride]
 * @property {string} [aoaiApiKeyOverride]
 * @property {string} [aoaiModelOverride]
 * @property {boolean} [enableInputAudioTranscription]
 * @property {Function} [onWebSocketOpen]
 * @property {Function} [onWebSocketClose]
 * @property {Function} [onWebSocketError]
 * @property {Function} [onWebSocketMessage]
 * @property {Function} [onReceivedResponseAudioDelta]
 * @property {Function} [onReceivedInputAudioBufferSpeechStarted]
 * @property {Function} [onReceivedResponseDone]
 * @property {Function} [onReceivedExtensionMiddleTierToolResponse]
 * @property {Function} [onReceivedResponseAudioTranscriptDelta]
 * @property {Function} [onReceivedInputAudioTranscriptionCompleted]
 * @property {Function} [onReceivedError]
 * @property {Function} [onReceivedMicControl]
 * @property {Function} [onReceivedResponseTextDelta]
 * @property {Function} [onReceivedResponseTextComplete] 
 * @property {Function} [onReceivedAudioFormat]
 * @property {Function} [onReceivedAudioStream]
 * @property {Function} [onReceivedAudioComplete]
 * @property {Function} [onReceivedSpeakerStatus]
 * @property {Function} [onReceivedLanguageStatus]
 * @property {boolean} [shouldConnect]
 */
export default function useRealTime({
    useDirectAoaiApi,
    aoaiEndpointOverride,
    aoaiApiKeyOverride,
    aoaiModelOverride,
    enableInputAudioTranscription,
    onWebSocketOpen,
    onWebSocketClose,
    onWebSocketError,
    onWebSocketMessage,
    onReceivedResponseDone,
    onReceivedResponseAudioDelta,
    onReceivedResponseAudioTranscriptDelta,
    onReceivedInputAudioBufferSpeechStarted,
    onReceivedExtensionMiddleTierToolResponse,
    onReceivedInputAudioTranscriptionCompleted,
    onReceivedError,
    onReceivedMicControl,
    onReceivedResponseTextDelta,
    onReceivedResponseTextComplete, 
    onReceivedAudioFormat,
    onReceivedAudioStream,
    onReceivedAudioComplete,
    onReceivedSpeakerStatus,
    onReceivedLanguageStatus,
    shouldConnect = false
}) {
    const wsEndpoint = useDirectAoaiApi
        ? `${aoaiEndpointOverride}/openai/realtime?api-key=${aoaiApiKeyOverride}&deployment=${aoaiModelOverride}&api-version=2024-10-01-preview`
        : `${process.env.NEXT_PUBLIC_WS_URL}/realtime`;

    const { sendJsonMessage, readyState, lastJsonMessage } = useWebSocket(wsEndpoint, {
        onOpen: () => {
            onWebSocketOpen?.();
        },
        onClose: () => onWebSocketClose?.(),
        onError: event => onWebSocketError?.(event),
        onMessage: event => {
            onMessageReceived(event);
            onWebSocketMessage?.(event);
        },
        shouldReconnect: () => shouldConnect,
        reconnectAttempts: 10,
        reconnectInterval: 3000,
    }, 
    shouldConnect
    );

    // Simplified safeSendJsonMessage that doesn't rely on ReadyState
    const safeSendJsonMessage = (message) => {
        try {
            sendJsonMessage(message);
        } catch (error) {
            console.error('Error sending message:', error);
        }
    };

    const startSession = (userLanguage = "en") => {
        const command = {
            type: "session.update",
            session: {
                turn_detection: {
                    type: "server_vad"
                },
                input_audio_transcription: {
                    model: "whisper-1",
                    language: userLanguage
                }
            }
        };
        safeSendJsonMessage(command);
    };

    const addUserAudio = (base64Audio) => {
        if (!base64Audio || typeof base64Audio !== "string") {
            console.error("Invalid base64Audio data:", base64Audio);
            return;
        }
    
        const command = {
            type: "input_audio_buffer.append",
            audio: base64Audio
        };
    
        console.log("Sending audio data:", command);
        safeSendJsonMessage(command);
    };

    const inputAudioBufferClear = () => {
        const command = {
            type: "input_audio_buffer.clear"
        };

        safeSendJsonMessage(command);
    };

    const stopSession = () => {
        inputAudioBufferClear();
    };

    const sendTextInput = async (text) => {
        if (text.trim()) {
            const textCommand = {
                type: "conversation.item.create",
                item: {
                    type: "message",
                    role: "user",
                    content: [
                        {
                            type: "input_text",
                            text: text
                        }
                    ]
                }
            };
            safeSendJsonMessage(textCommand);

            const responseCommand = {
                type: "response.create"
            };
            safeSendJsonMessage(responseCommand);
        }
    };

    const onMessageReceived = (event) => {
        try {
            const message = JSON.parse(event.data);
            // [TEMPORARY DEBUGGING CODE - REMOVE AFTER TESTING]
            // Log all message types to help troubleshoot
            console.log(`WebSocket received message type: ${message.type}`, message);
            
            switch (message.type) {
                case "response.done":
                    onReceivedResponseDone?.(message);
                    break;
                case "response.audio.delta":
                    onReceivedResponseAudioDelta?.(message);
                    break;
                case "response.audio_transcript.delta":
                    onReceivedResponseAudioTranscriptDelta?.(message);
                    break;
                case "response.text.delta":
                    onReceivedResponseTextDelta?.(message.delta);
                    break;
                case "response.text.done":
                    if (message.text) {
                        onReceivedResponseTextDelta?.(message.text);
                    }
                    break;
                case "input_audio_buffer.speech_started":
                    onReceivedInputAudioBufferSpeechStarted?.(message);
                    break;
                case "mic_control":
                    // Handle microphone control messages from the backend
                    onReceivedMicControl?.(message);
                    break;
                case "speaker.status":
                    // Handle speaker status update from the backend
                    onReceivedSpeakerStatus?.(message);
                    break;
                case "language.status":
                    // Handle language status update from the backend
                    onReceivedLanguageStatus?.(message);
                    break;
                case "conversation.item.input_audio_transcription.completed":
                    if (message.item?.content?.[0]?.transcript) {
                        console.log("Final transcript:", message.item.content[0].transcript);
                        onReceivedInputAudioTranscriptionCompleted?.({
                            transcript: message.item.content[0].transcript
                        });
                    } else {
                        console.warn("Received transcription event but no transcript found", message);
                    }
                    break;
                case "conversation.item.create":
                    // [TEMPORARY DEBUGGING CODE - REMOVE AFTER TESTING]
                    console.log("DEBUG: Received conversation item create event", message);
                    if (message.item?.content?.[0]?.transcript) {
                        // [TEMPORARY DEBUGGING CODE - REMOVE AFTER TESTING]
                        console.log("DEBUG: Found transcript in conversation item:", message.item.content[0].transcript);
                        onReceivedInputAudioTranscriptionCompleted?.({
                            transcript: message.item.content[0].transcript
                        });
                    } else {
                        // [TEMPORARY DEBUGGING CODE - REMOVE AFTER TESTING]
                        console.warn("DEBUG: Received conversation item create but no transcript found", message);
                    }
                    break;
                case "extension.middle_tier_tool_response":
                    onReceivedExtensionMiddleTierToolResponse?.(message);
                    break;
                case "bot.text.complete":
                    onReceivedResponseTextComplete?.(message);
                    break;
                case "bot.audio.format":
                    onReceivedAudioFormat?.(message);
                    break;
                case "bot.audio.stream":
                    onReceivedAudioStream?.(message);
                    break;
                case "bot.audio.complete":
                    onReceivedAudioComplete?.(message);
                    break;
                case "error":
                    // Only process if there's actual content
                    if (message && Object.keys(message).length > 0 && 
                        Object.keys(message).some(key => key !== 'type')) {
                        onReceivedError?.(message);
                    }
                    break;
            }
        } catch (e) {
            console.error('Error parsing WebSocket message:', e);
        }
    };

    return { 
        startSession, 
        addUserAudio, 
        inputAudioBufferClear,
        stopSession,
        sendTextInput,
        sendJsonMessage: safeSendJsonMessage
    };
}

_j · March 11, 2025, 7:03am

With the voice models, there are some options you could play with to increase the ability to transcribe output - and to get good quality output.

How about the voice you choose? It is possible that one voice is an easier source for transcriptions, especially considering the variety ranging from male to female personas.

You can also instruct the tone, giving instructions that the style of responses is professional, slow, articulate, clear, less emotional, etc. The AI model following system instructions might actually be more well-spoken.

The temperature parameter at the lowest allowed, 0.6, can improve audio prediction. Using audio models on chat completions, you can go lower, until the voices dramatically malfunction for some reason at low temperature settings, and higher, where the voices start having strange artifacts and are far more “creative” in the language used.

If the best transcription is not what you want to be heard: double your cost and run it twice. However, the AI might say something different the second time.

Topic		Replies	Views
Use new model for realtime audio transcription API transcribe	7	3846	April 9, 2026
Input_audio_transcription in realtime-api API	5	5531	February 20, 2025
How to setup transcription on Realtime API with SIP API api , realtime , gpt-realtime	30	2910	October 10, 2025
GPT-4o-transcribe and audio model ready to use via API? API transcribe	10	4323	March 17, 2026
New audio models in the API + tools for voice agents Announcements	27	6742	July 13, 2025

Extracting Transcription Without Using input_audio.input_transcription in OpenAI API

Related topics