Low and slow audio from realtime API, how to properly audio format?

Hi,

I’m working on a project where I connect the GPT-4 Realtime API via WebSockets to a Vonage Voice API WebSocket. The goal is to facilitate a two-way audio conversation where OpenAI can both hear and respond to the user in real time. I’m encountering two critical issues:

  1. Voice Quality: The assistant’s voice sounds slowed down and low-pitched.
  2. Speech Detection: OpenAI doesn’t seem to hear or respond to user audio, despite receiving audio data from Vonage.

Setup Overview:

  • Environment: Python with aiohttp, websockets, and pydub.
  • Vonage WebSocket: Receives and sends audio in PCM 16-bit Linear, 16kHz format.
  • OpenAI Realtime API: Configured for 16kHz PCM signed 16-bit little-endian audio.

Current Workflow:

  1. Vonage to OpenAI: Receive binary audio data from Vonage, base64-encode it, and send it to OpenAI’s input audio buffer.
  2. OpenAI to Vonage: Decode the base64 audio response from OpenAI and send it as 16kHz PCM back to Vonage.

Code Summary: Below is a simplified version of my code with the relevant functions for managing WebSocket connections and audio handling.

import asyncio
import json
import os
import base64
from aiohttp import web, WSMsgType
from pydub import AudioSegment
import websockets
import logging

logging.basicConfig(level=logging.INFO)
logger = logging.getLogger(__name__)

OPENAI_API_KEY = os.getenv("OPENAI_API_KEY")
VOICE = 'alloy'
SYSTEM_MESSAGE = "Friendly assistant ready to chat and offer insights."

async def initialize_session(openai_ws):
    session_update = {
        "type": "session.update",
        "session": {
            "turn_detection": {
                "type": "server_vad",
                "threshold": 0.4,
                "prefix_padding_ms": 200,
                "silence_duration_ms": 300
            },
            "input_audio_format": "pcm_s16le_16000",
            "output_audio_format": "pcm_s16le_16000",
            "voice": VOICE,
            "instructions": SYSTEM_MESSAGE,
            "modalities": ["text", "audio"],
            "temperature": 0.8,
            "input_audio_transcription": {
                "model": "whisper-1"
            }
        }
    }
    await openai_ws.send(json.dumps(session_update))

async def vonage_to_openai(vonage_ws, openai_ws):
    try:
        async for msg in vonage_ws:
            if msg.type == WSMsgType.BINARY:
                encoded_chunk = base64.b64encode(msg.data).decode('utf-8')
                audio_append = {
                    "type": "input_audio_buffer.append",
                    "audio": encoded_chunk
                }
                await openai_ws.send(json.dumps(audio_append))
                
                commit_event = {
                    "type": "input_audio_buffer.commit"
                }
                await openai_ws.send(json.dumps(commit_event))
            else:
                logger.warning("Non-binary message received from Vonage.")
    except Exception as e:
        logger.error("Error in vonage_to_openai: %s", e)

async def openai_to_vonage(openai_ws, vonage_ws):
    try:
        while True:
            message = await openai_ws.recv()
            data = json.loads(message)
            event_type = data.get('type')

            if event_type == 'response.audio.delta':
                audio_base64 = data.get('delta')
                if audio_base64:
                    audio_bytes = base64.b64decode(audio_base64)
                    frame_size = 640
                    for i in range(0, len(audio_bytes), frame_size):
                        chunk = audio_bytes[i:i + frame_size]
                        await vonage_ws.send_bytes(chunk)
                        await asyncio.sleep(0.02)
            elif event_type == 'response.done':
                logger.info("Response generation completed.")
    except Exception as e:
        logger.error("Error in openai_to_vonage: %s", e)

Troubleshooting Steps Taken

  1. Audio Format Validation: Ensured both Vonage and OpenAI are set to use PCM 16-bit Linear, 16kHz audio format. Set input_audio_format and output_audio_format in session.update to pcm_s16le_16000.
  2. Logging: Verified that audio data is being sent from Vonage and received by OpenAI.
  3. Resampling: Removed any unnecessary resampling to avoid format mismatches between Vonage and OpenAI.
  4. Data Encoding: Audio data is base64-encoded correctly for transmission to OpenAI and decoded correctly from OpenAI responses.

Despite these adjustments, the assistant’s voice remains low-pitched and slow, and there is no indication that OpenAI is receiving or processing the user’s speech.

Good job on the comprehensive problem statement and all the necessary info, really appreciate that.

However, when it comes to the issues you have pointed out, it is really important to understand how the input audio is being recorded, and how does the synthesized audio being played back.

Additionally, I have noticed that you are providing pcm_s16le_16000 as your audio formats. In the docs, this should be just pcm16 instead. Another big problem is that in the integration section 16kHz sample rate for PCM 16-bit is not specified as being supported, so you would have to upsample pre-input and downsample post-output.

I personally didn’t experience such issues neither in the playground, nor in the reference client, nor in my own client (although I’m only using G.711 u-law). If you change the format in the session.update to follow the docs and the problem doesn’t go away, this could very well be an issue outside of the scope of your python client and realtime API in general.

If that’s the case, you could try to add additional processing on the audio pre-input and post-output from realtime API, which should be pretty easy to implement

1 Like

Hi Ivan,

thanks for the reply! I am not experienced working with audio so this is a struggle.

I made some changes, but there is the same issue. A low, slow voice and not responding to what I am saying (can’t hear me):

Changed Audio Format in session.update:

  • In the original script, the input_audio_format and output_audio_format were set as "pcm_s16le_16000". This was updated to "pcm16" as specified in the OpenAI documentation:

Code Change:

"input_audio_format": "pcm16",
"output_audio_format": "pcm16",

2. Resampling to 24kHz for OpenAI Compatibility:

  • Since OpenAI requires audio to be in 24kHz PCM for the Realtime API, a resample_audio function was added. This function uses pydub to resample incoming audio data from 16kHz (Vonage’s rate) to 24kHz before sending it to OpenAI. Similarly, it down-samples OpenAI’s 24kHz output back to 16kHz for Vonage.

Code Change:

def resample_audio(data: bytes, target_rate=24000):
    """Resample audio data to 24kHz mono PCM 16-bit."""
    audio = AudioSegment.from_raw(io.BytesIO(data), sample_width=2, frame_rate=16000, channels=1)
    resampled_audio = audio.set_frame_rate(target_rate).set_sample_width(2).set_channels(1)
    return resampled_audio.raw_data

3. Updated Audio Handling in vonage_to_openai and openai_to_vonage:

  • In vonage_to_openai: The audio received from Vonage is now resampled to 24kHz and then base64-encoded before sending to OpenAI.

Code Change:

resampled_audio = resample_audio(msg.data, target_rate=24000)
encoded_chunk = base64.b64encode(resampled_audio).decode('utf-8')
  • In openai_to_vonage: The audio received from OpenAI is decoded, resampled back to 16kHz, and sent in 20ms chunks to Vonage, ensuring real-time alignment.

Code Change:

resampled_audio = resample_audio(audio_bytes, target_rate=16000)
frame_size = 640  # 20ms at 16kHz
for i in range(0, len(resampled_audio), frame_size):
    chunk = resampled_audio[i:i + frame_size]

I don’t have much of experience with working with audio in python and libs from its ecosystem, but your code looks about right.

Your next best bet is to record whatever comes from Vonage into a file (with this resampling enabled), and whatever comes from realtime API into another file (also with resampling enabled).

You would then open those files with some software that lets you see the audio format information properly, listen to the audio, and that should give you a pretty good idea of what went wrong. Feel free to share your findings here

Edit: Took a closer look at your resampling code, you might want to add .raw_data at the end of resampled_audio = audio.set_frame_rate(target_rate).set_sample_width(2).set_channels(1) (this is how it’s done in the integration guide mentioned before)

@wassaa I am facing an issue where if I use the 24000 frequency, the audio output is coming perfect. But if I use 48000 frequency as audio input and downsample it to 24000, the audio output is coming as chipmunk voice.

Are you able to resolve your issue? I’m banging my head for the last 2 days but not able to resolve it. Any leads will be appreciated.

what output format are you setting in the session configuration? Would also be useful to see all the relevant code

I got it solved the issue was I was not setting the audio frequency when playing the audio. Here is my repo verbal-ai on github

@wassaa were you able to solve the issiue? I am having same problem thanks.

List item