Is the `gpt-realtime-translate` audio frame size documented incorrectly?

mlinke · June 10, 2026, 12:43pm

The documentation of the real-time translation server events of the gpt-realtime-translate API states that output audio deltas get streamed in frames of 200ms of PCM16 audio encoded in Base64. The sample rate for both input and output audio is fixed at 24kHz. If my math is correct the 200ms of audio at 24kHz should equal 4800 samples per frame.
However, after decoding the Base64 string and converting the bytes to a NumPy array the array contains 9600 samples, which is equal to 400ms of audio at the given sample rate.
I am using the following code, which I adapted from the real-time conversations guide for Python to decode and convert the audio:

def pcm16_to_float(pcm16):
    pcm16_iter = map(lambda x: x[0] / 32767, struct.iter_unpack("<h", pcm16))
    return np.fromiter(pcm16_iter, dtype=np.float32)

def base64_decode_audio(encoded):
    decoded = base64.b64decode(encoded)
    float32_array = pcm16_to_float(decoded)
    return float32_array

Initially my code threw an error as I expected only 4800 samples and allocated a buffer accordingly. After increasing the buffer size to 9600 samples I am able to replay the output audio stream without any distortions.
My questions is whether I am decoding the frames wrong or the documentation states a wrong frame size.

Topic		Replies	Views
Response.audio.delta is very fast API realtime	2	666	October 21, 2024
Issues with GPT-4o-transcribe API API realtime	15	3369	April 1, 2026
Realtime API, getUserMedia, and WebRTC - does mic audio need to be converted to PCM16 for whisper ai transcription to work? API	0	198	February 7, 2025
Playing audio in JS sent from realtime API API realtime	14	9351	September 6, 2025
Realtime API: session update doesn't change input audio format Bugs realtime	25	3396	November 19, 2024

Is the `gpt-realtime-translate` audio frame size documented incorrectly?

Related topics