Issues with GPT-4o-transcribe API

Hi,

I’m new to using the OpenAI realtime API with GPT-4o-transcribe via WebSockets. My code successfully connects and streams audio from the microphone, but I’m experiencing poor quality and slow transcription responses.

Has anyone else encountered similar issues?

Below is my current code:

import os
import json
import base64
import threading
import pyaudio
import websocket
from dotenv import load_dotenv

load_dotenv()  # Load environment variables from .env

OPENAI_API_KEY = os.environ.get("OPENAI_API_KEY")
if not OPENAI_API_KEY:
    raise RuntimeError("❌ OPENAI_API_KEY is missing!")

# WebSocket endpoint for OpenAI Realtime API (transcription model)
# url = "wss://api.openai.com/v1/realtime?model=gpt-4o-realtime-preview-2024-12-17"
url = "wss://api.openai.com/v1/realtime?intent=transcription"
headers = [
    "Authorization: Bearer " + OPENAI_API_KEY,
    "OpenAI-Beta: realtime=v1"
]

# Audio stream parameters (16-bit PCM, 16kHz mono)
RATE = 16000
CHANNELS = 1
FORMAT = pyaudio.paInt16
CHUNK = 1024

audio_interface = pyaudio.PyAudio()
stream = audio_interface.open(format=FORMAT,
                              channels=CHANNELS,
                              rate=RATE,
                              input=True,
                              frames_per_buffer=CHUNK)

def on_open(ws):
    print("Connected! Start speaking...")
    session_config = {
        "type": "transcription_session.update",
        "session": {
            "input_audio_format": "pcm16",
            "input_audio_transcription": {
                "model": "gpt-4o-transcribe",
                # "language": "zh",
                "prompt": "Respond in English."
            },
            "input_audio_noise_reduction": {"type": "near_field"},
            "turn_detection": {"type": "server_vad"}
        }
    }
    ws.send(json.dumps(session_config))

    def stream_microphone():
        try:
            while ws.keep_running:
                audio_data = stream.read(CHUNK, exception_on_overflow=False)
                audio_base64 = base64.b64encode(audio_data).decode('utf-8')
                ws.send(json.dumps({
                    "type": "input_audio_buffer.append",
                    "audio": audio_base64
                }))
        except Exception as e:
            print("Audio streaming error:", e)
            ws.close()

    threading.Thread(target=stream_microphone, daemon=True).start()


def on_message(ws, message):
    try:
        data = json.loads(message)
        event_type = data.get("type", "")
        print(data)   
        # Stream live incremental transcripts
        if event_type == "conversation.item.input_audio_transcription.delta":
            transcript_piece = data.get("delta", "")
            if transcript_piece:
                print(transcript_piece, end=' ', flush=True)

    except Exception:
        pass  # Ignore unrelated events





def on_error(ws, error):
    print("WebSocket error:", error)

def on_close(ws, close_status_code, close_msg):
    print("Disconnected from server.")
    stream.stop_stream()
    stream.close()
    audio_interface.terminate()

print("Connecting to OpenAI Realtime API...")
ws_app = websocket.WebSocketApp(
    url,
    header=headers,
    on_open=on_open,
    on_message=on_message,
    on_error=on_error,
    on_close=on_close
)

ws_app.run_forever()

I am experiencing similar results.
But note something about your code - they mention you need to use 24kHz sampling, not 16. So maybe try changing RATE to 24000

1 Like

Where do they say that 24kHz sampling is needed?

Hidden under https://platform.openai.com/docs/api-reference/realtime-sessions/create#realtime-sessions-create-input_audio_format

1 Like

Thanks! I changed the sampling rate and it might slightly improve i guess? But still struggling with high latency and inaccuracy, especially compared to the GPT app.

Have you tried any API that can do realtime transcribe/translation?

Currently using AWS transcribe which works relatively well. But I was hoping for more accuracy for less common languages from gpt-4o-transcribe. But the latency is high and the results feel worse than Whisper.

Try playing around with the ‘turn_detection’ parameters. By default, it seems to be trying to transcribe very aggressively.

  • threshold: Activation threshold (0 to 1). A higher threshold requires louder audio to activate the model, which might improve performance in noisy environments.
  • prefix_padding_ms: Amount of audio (in milliseconds) to include before the voice activity detection (VAD) detects speech.
  • silence_duration_ms: Duration of silence (in milliseconds) needed to detect the end of speech. Shorter values will detect turns more quickly.

When I first tried it, I was surprised by how poorly it performed, but try increasing prefix_padding_ms to 1 second and see if that helps

You can also look into their Semantic VAD, which works well too: https://platform.openai.com/docs/guides/realtime-vad#semantic-vad.

The problem seems to be not with the model itself, but rather with how often it tries to process data.

I also have the same latency issue (the quality is good enough for me though).
Even the mini model (gpt-4o-mini-transcribe) is several times slower than Deepgram (the mini model typically takes 1.5s-2s to output the transcripts, which is too slow for realtime conversation).

Does OpenAI really supports the WebSocket connection for this transcribe model?

That’s for the gpt-4o-realtime-preview.

I would like to know if it also holds for the gpt-4o-transcribe. More specifically, does it downsample anything higher to 24 khz or to 16 khz? That’s why I came here. I know you may not know but someone tag OpenAI pls. Ty.

The realtime API, presenting only gpt-4o variants, informs us about the underlying model itself.

Input

For pcm16, input audio must be 16-bit PCM at a 24kHz sample rate, single channel (mono), and little-endian byte order.

Output

For pcm16, output audio is sampled at a rate of 24kHz.

As this reflects the internal format that the AI model is trained on (after convolution), anything accepted by the API beyond that would need resampling to align with the encoded training corpus.

That’s true for the Realtime API, but does it also hold for gpt-4o-transcribe? gpt-4o-transcribe is different than the Realtime API.

After extensive research with o3, it seems that it does, even though it is not explicitly acknowledged anywhere.

We don’t use the RealTime API. We use gpt-4o-transcribe with MP3s with a 44.1 kHz sample rate - works just fine.

What do you mean by o3?

The gpt-4o model itself is pretrained on audio, then the fine-tuning to make it trained to take input audio and generate text.

We can infer that all model encoded audio that is transformed to tokens both for input and output learning would have a unifying internal format.

One can then extrapolate that when you see “raw” or “pcm” only having one format you must provide or receive, across several endpoints that expose such I/O, this is that model’s native sample rate and channel count input to its codec.

Perceptual lossy audio like mp3 would be decoded to the needed destination.