Is the `gpt-realtime-translate` audio frame size documented incorrectly?

The documentation of the real-time translation server events of the gpt-realtime-translate API states that output audio deltas get streamed in frames of 200ms of PCM16 audio encoded in Base64. The sample rate for both input and output audio is fixed at 24kHz. If my math is correct the 200ms of audio at 24kHz should equal 4800 samples per frame.
However, after decoding the Base64 string and converting the bytes to a NumPy array the array contains 9600 samples, which is equal to 400ms of audio at the given sample rate.
I am using the following code, which I adapted from the real-time conversations guide for Python to decode and convert the audio:

def pcm16_to_float(pcm16):
    pcm16_iter = map(lambda x: x[0] / 32767, struct.iter_unpack("<h", pcm16))
    return np.fromiter(pcm16_iter, dtype=np.float32)

def base64_decode_audio(encoded):
    decoded = base64.b64decode(encoded)
    float32_array = pcm16_to_float(decoded)
    return float32_array

Initially my code threw an error as I expected only 4800 samples and allocated a buffer accordingly. After increasing the buffer size to 9600 samples I am able to replay the output audio stream without any distortions.
My questions is whether I am decoding the frames wrong or the documentation states a wrong frame size.