Issues with GPT-4o-transcribe API

hp1397 · March 22, 2025, 12:40am

Hi,

I’m new to using the OpenAI realtime API with GPT-4o-transcribe via WebSockets. My code successfully connects and streams audio from the microphone, but I’m experiencing poor quality and slow transcription responses.

Has anyone else encountered similar issues?

Below is my current code:

import os
import json
import base64
import threading
import pyaudio
import websocket
from dotenv import load_dotenv

load_dotenv()  # Load environment variables from .env

OPENAI_API_KEY = os.environ.get("OPENAI_API_KEY")
if not OPENAI_API_KEY:
    raise RuntimeError("❌ OPENAI_API_KEY is missing!")

# WebSocket endpoint for OpenAI Realtime API (transcription model)
# url = "wss://api.openai.com/v1/realtime?model=gpt-4o-realtime-preview-2024-12-17"
url = "wss://api.openai.com/v1/realtime?intent=transcription"
headers = [
    "Authorization: Bearer " + OPENAI_API_KEY,
    "OpenAI-Beta: realtime=v1"
]

# Audio stream parameters (16-bit PCM, 16kHz mono)
RATE = 16000
CHANNELS = 1
FORMAT = pyaudio.paInt16
CHUNK = 1024

audio_interface = pyaudio.PyAudio()
stream = audio_interface.open(format=FORMAT,
                              channels=CHANNELS,
                              rate=RATE,
                              input=True,
                              frames_per_buffer=CHUNK)

def on_open(ws):
    print("Connected! Start speaking...")
    session_config = {
        "type": "transcription_session.update",
        "session": {
            "input_audio_format": "pcm16",
            "input_audio_transcription": {
                "model": "gpt-4o-transcribe",
                # "language": "zh",
                "prompt": "Respond in English."
            },
            "input_audio_noise_reduction": {"type": "near_field"},
            "turn_detection": {"type": "server_vad"}
        }
    }
    ws.send(json.dumps(session_config))

    def stream_microphone():
        try:
            while ws.keep_running:
                audio_data = stream.read(CHUNK, exception_on_overflow=False)
                audio_base64 = base64.b64encode(audio_data).decode('utf-8')
                ws.send(json.dumps({
                    "type": "input_audio_buffer.append",
                    "audio": audio_base64
                }))
        except Exception as e:
            print("Audio streaming error:", e)
            ws.close()

    threading.Thread(target=stream_microphone, daemon=True).start()


def on_message(ws, message):
    try:
        data = json.loads(message)
        event_type = data.get("type", "")
        print(data)   
        # Stream live incremental transcripts
        if event_type == "conversation.item.input_audio_transcription.delta":
            transcript_piece = data.get("delta", "")
            if transcript_piece:
                print(transcript_piece, end=' ', flush=True)

    except Exception:
        pass  # Ignore unrelated events





def on_error(ws, error):
    print("WebSocket error:", error)

def on_close(ws, close_status_code, close_msg):
    print("Disconnected from server.")
    stream.stop_stream()
    stream.close()
    audio_interface.terminate()

print("Connecting to OpenAI Realtime API...")
ws_app = websocket.WebSocketApp(
    url,
    header=headers,
    on_open=on_open,
    on_message=on_message,
    on_error=on_error,
    on_close=on_close
)

ws_app.run_forever()

ron.suhodrev · March 23, 2025, 1:00am

I am experiencing similar results.
But note something about your code - they mention you need to use 24kHz sampling, not 16. So maybe try changing RATE to 24000

scott_no35 · March 24, 2025, 8:20am

Where do they say that 24kHz sampling is needed?

ron.suhodrev · March 24, 2025, 10:59am

Hidden under https://platform.openai.com/docs/api-reference/realtime-sessions/create#realtime-sessions-create-input_audio_format

hp1397 · March 24, 2025, 1:57pm

Thanks! I changed the sampling rate and it might slightly improve i guess? But still struggling with high latency and inaccuracy, especially compared to the GPT app.

hp1397 · March 24, 2025, 1:58pm

Have you tried any API that can do realtime transcribe/translation?

ron.suhodrev · March 24, 2025, 3:01pm

Currently using AWS transcribe which works relatively well. But I was hoping for more accuracy for less common languages from gpt-4o-transcribe. But the latency is high and the results feel worse than Whisper.

kristianvtr · March 25, 2025, 4:00pm

Try playing around with the ‘turn_detection’ parameters. By default, it seems to be trying to transcribe very aggressively.

threshold: Activation threshold (0 to 1). A higher threshold requires louder audio to activate the model, which might improve performance in noisy environments.
prefix_padding_ms: Amount of audio (in milliseconds) to include before the voice activity detection (VAD) detects speech.
silence_duration_ms: Duration of silence (in milliseconds) needed to detect the end of speech. Shorter values will detect turns more quickly.

When I first tried it, I was surprised by how poorly it performed, but try increasing prefix_padding_ms to 1 second and see if that helps

You can also look into their Semantic VAD, which works well too: https://platform.openai.com/docs/guides/realtime-vad#semantic-vad.

The problem seems to be not with the model itself, but rather with how often it tries to process data.

f10w · March 31, 2025, 3:57pm

I also have the same latency issue (the quality is good enough for me though).
Even the mini model (gpt-4o-mini-transcribe) is several times slower than Deepgram (the mini model typically takes 1.5s-2s to output the transcripts, which is too slow for realtime conversation).

amimukherjee · April 25, 2025, 6:03pm

Does OpenAI really supports the WebSocket connection for this transcribe model?

AI_issues · July 1, 2025, 5:09am

That’s for the gpt-4o-realtime-preview.

I would like to know if it also holds for the gpt-4o-transcribe. More specifically, does it downsample anything higher to 24 khz or to 16 khz? That’s why I came here. I know you may not know but someone tag OpenAI pls. Ty.

_j · July 4, 2025, 1:55pm

The realtime API, presenting only gpt-4o variants, informs us about the underlying model itself.

Input

For pcm16, input audio must be 16-bit PCM at a 24kHz sample rate, single channel (mono), and little-endian byte order.

Output

For pcm16, output audio is sampled at a rate of 24kHz.

As this reflects the internal format that the AI model is trained on (after convolution), anything accepted by the API beyond that would need resampling to align with the encoded training corpus.

AI_issues · July 18, 2025, 1:57am

That’s true for the Realtime API, but does it also hold for gpt-4o-transcribe? gpt-4o-transcribe is different than the Realtime API.

After extensive research with o3, it seems that it does, even though it is not explicitly acknowledged anywhere.

jeffvpace · July 18, 2025, 5:32am

We don’t use the RealTime API. We use gpt-4o-transcribe with MP3s with a 44.1 kHz sample rate - works just fine.

What do you mean by o3?

_j · July 19, 2025, 1:11am

The gpt-4o model itself is pretrained on audio, then the fine-tuning to make it trained to take input audio and generate text.

We can infer that all model encoded audio that is transformed to tokens both for input and output learning would have a unifying internal format.

One can then extrapolate that when you see “raw” or “pcm” only having one format you must provide or receive, across several endpoints that expose such I/O, this is that model’s native sample rate and channel count input to its codec.

Perceptual lossy audio like mp3 would be decoded to the needed destination.

Topic		Replies	Views
GPT-4o-transcribe and audio model ready to use via API? API transcribe	9	1745	July 2, 2025
Extracting Transcription Without Using input_audio.input_transcription in OpenAI API API realtime , api-realtime	10	451	March 11, 2025
RealTime API Transcription errors Bugs realtime	7	1938	January 9, 2025
Gpt-4o-transcribe truncates the transcript API transcribe	14	1852	July 31, 2025
Gpt-4o-audio-preview unsupported_format error API	4	682	March 18, 2025

Issues with GPT-4o-transcribe API

Input

Output

Related topics