Issues with GPT-4o-transcribe API

Hi,

I’m new to using the OpenAI realtime API with GPT-4o-transcribe via WebSockets. My code successfully connects and streams audio from the microphone, but I’m experiencing poor quality and slow transcription responses.

Has anyone else encountered similar issues?

Below is my current code:

import os
import json
import base64
import threading
import pyaudio
import websocket
from dotenv import load_dotenv

load_dotenv()  # Load environment variables from .env

OPENAI_API_KEY = os.environ.get("OPENAI_API_KEY")
if not OPENAI_API_KEY:
    raise RuntimeError("❌ OPENAI_API_KEY is missing!")

# WebSocket endpoint for OpenAI Realtime API (transcription model)
# url = "wss://api.openai.com/v1/realtime?model=gpt-4o-realtime-preview-2024-12-17"
url = "wss://api.openai.com/v1/realtime?intent=transcription"
headers = [
    "Authorization: Bearer " + OPENAI_API_KEY,
    "OpenAI-Beta: realtime=v1"
]

# Audio stream parameters (16-bit PCM, 16kHz mono)
RATE = 16000
CHANNELS = 1
FORMAT = pyaudio.paInt16
CHUNK = 1024

audio_interface = pyaudio.PyAudio()
stream = audio_interface.open(format=FORMAT,
                              channels=CHANNELS,
                              rate=RATE,
                              input=True,
                              frames_per_buffer=CHUNK)

def on_open(ws):
    print("Connected! Start speaking...")
    session_config = {
        "type": "transcription_session.update",
        "session": {
            "input_audio_format": "pcm16",
            "input_audio_transcription": {
                "model": "gpt-4o-transcribe",
                # "language": "zh",
                "prompt": "Respond in English."
            },
            "input_audio_noise_reduction": {"type": "near_field"},
            "turn_detection": {"type": "server_vad"}
        }
    }
    ws.send(json.dumps(session_config))

    def stream_microphone():
        try:
            while ws.keep_running:
                audio_data = stream.read(CHUNK, exception_on_overflow=False)
                audio_base64 = base64.b64encode(audio_data).decode('utf-8')
                ws.send(json.dumps({
                    "type": "input_audio_buffer.append",
                    "audio": audio_base64
                }))
        except Exception as e:
            print("Audio streaming error:", e)
            ws.close()

    threading.Thread(target=stream_microphone, daemon=True).start()


def on_message(ws, message):
    try:
        data = json.loads(message)
        event_type = data.get("type", "")
        print(data)   
        # Stream live incremental transcripts
        if event_type == "conversation.item.input_audio_transcription.delta":
            transcript_piece = data.get("delta", "")
            if transcript_piece:
                print(transcript_piece, end=' ', flush=True)

    except Exception:
        pass  # Ignore unrelated events





def on_error(ws, error):
    print("WebSocket error:", error)

def on_close(ws, close_status_code, close_msg):
    print("Disconnected from server.")
    stream.stop_stream()
    stream.close()
    audio_interface.terminate()

print("Connecting to OpenAI Realtime API...")
ws_app = websocket.WebSocketApp(
    url,
    header=headers,
    on_open=on_open,
    on_message=on_message,
    on_error=on_error,
    on_close=on_close
)

ws_app.run_forever()

I am experiencing similar results.
But note something about your code - they mention you need to use 24kHz sampling, not 16. So maybe try changing RATE to 24000

1 Like

Where do they say that 24kHz sampling is needed?

Hidden under https://platform.openai.com/docs/api-reference/realtime-sessions/create#realtime-sessions-create-input_audio_format

1 Like

Thanks! I changed the sampling rate and it might slightly improve i guess? But still struggling with high latency and inaccuracy, especially compared to the GPT app.

Have you tried any API that can do realtime transcribe/translation?

Currently using AWS transcribe which works relatively well. But I was hoping for more accuracy for less common languages from gpt-4o-transcribe. But the latency is high and the results feel worse than Whisper.

Try playing around with the ‘turn_detection’ parameters. By default, it seems to be trying to transcribe very aggressively.

  • threshold: Activation threshold (0 to 1). A higher threshold requires louder audio to activate the model, which might improve performance in noisy environments.
  • prefix_padding_ms: Amount of audio (in milliseconds) to include before the voice activity detection (VAD) detects speech.
  • silence_duration_ms: Duration of silence (in milliseconds) needed to detect the end of speech. Shorter values will detect turns more quickly.

When I first tried it, I was surprised by how poorly it performed, but try increasing prefix_padding_ms to 1 second and see if that helps

You can also look into their Semantic VAD, which works well too: https://platform.openai.com/docs/guides/realtime-vad#semantic-vad.

The problem seems to be not with the model itself, but rather with how often it tries to process data.

I also have the same latency issue (the quality is good enough for me though).
Even the mini model (gpt-4o-mini-transcribe) is several times slower than Deepgram (the mini model typically takes 1.5s-2s to output the transcripts, which is too slow for realtime conversation).