I’m new to using the OpenAI realtime API with GPT-4o-transcribe via WebSockets. My code successfully connects and streams audio from the microphone, but I’m experiencing poor quality and slow transcription responses.
I am experiencing similar results.
But note something about your code - they mention you need to use 24kHz sampling, not 16. So maybe try changing RATE to 24000
Thanks! I changed the sampling rate and it might slightly improve i guess? But still struggling with high latency and inaccuracy, especially compared to the GPT app.
Currently using AWS transcribe which works relatively well. But I was hoping for more accuracy for less common languages from gpt-4o-transcribe. But the latency is high and the results feel worse than Whisper.
Try playing around with the ‘turn_detection’ parameters. By default, it seems to be trying to transcribe very aggressively.
threshold: Activation threshold (0 to 1). A higher threshold requires louder audio to activate the model, which might improve performance in noisy environments.
prefix_padding_ms: Amount of audio (in milliseconds) to include before the voice activity detection (VAD) detects speech.
silence_duration_ms: Duration of silence (in milliseconds) needed to detect the end of speech. Shorter values will detect turns more quickly.
When I first tried it, I was surprised by how poorly it performed, but try increasing prefix_padding_ms to 1 second and see if that helps
I also have the same latency issue (the quality is good enough for me though).
Even the mini model (gpt-4o-mini-transcribe) is several times slower than Deepgram (the mini model typically takes 1.5s-2s to output the transcripts, which is too slow for realtime conversation).
I would like to know if it also holds for the gpt-4o-transcribe. More specifically, does it downsample anything higher to 24 khz or to 16 khz? That’s why I came here. I know you may not know but someone tag OpenAI pls. Ty.
The realtime API, presenting only gpt-4o variants, informs us about the underlying model itself.
Input
For pcm16, input audio must be 16-bit PCM at a 24kHz sample rate, single channel (mono), and little-endian byte order.
Output
For pcm16, output audio is sampled at a rate of 24kHz.
As this reflects the internal format that the AI model is trained on (after convolution), anything accepted by the API beyond that would need resampling to align with the encoded training corpus.
The gpt-4o model itself is pretrained on audio, then the fine-tuning to make it trained to take input audio and generate text.
We can infer that all model encoded audio that is transformed to tokens both for input and output learning would have a unifying internal format.
One can then extrapolate that when you see “raw” or “pcm” only having one format you must provide or receive, across several endpoints that expose such I/O, this is that model’s native sample rate and channel count input to its codec.
Perceptual lossy audio like mp3 would be decoded to the needed destination.
Today is 2026, gpt-4o-transcribe realtime api still has many limitation when working with prompt(it tries to return prompt if audio quality is not perfect). Anyone got same issue?