Hi,
I’m working on a project where I connect the GPT-4 Realtime API via WebSockets to a Vonage Voice API WebSocket. The goal is to facilitate a two-way audio conversation where OpenAI can both hear and respond to the user in real time. I’m encountering two critical issues:
- Voice Quality: The assistant’s voice sounds slowed down and low-pitched.
- Speech Detection: OpenAI doesn’t seem to hear or respond to user audio, despite receiving audio data from Vonage.
Setup Overview:
- Environment: Python with
aiohttp
,websockets
, andpydub
. - Vonage WebSocket: Receives and sends audio in PCM 16-bit Linear, 16kHz format.
- OpenAI Realtime API: Configured for 16kHz PCM signed 16-bit little-endian audio.
Current Workflow:
- Vonage to OpenAI: Receive binary audio data from Vonage, base64-encode it, and send it to OpenAI’s input audio buffer.
- OpenAI to Vonage: Decode the base64 audio response from OpenAI and send it as 16kHz PCM back to Vonage.
Code Summary: Below is a simplified version of my code with the relevant functions for managing WebSocket connections and audio handling.
import asyncio
import json
import os
import base64
from aiohttp import web, WSMsgType
from pydub import AudioSegment
import websockets
import logging
logging.basicConfig(level=logging.INFO)
logger = logging.getLogger(__name__)
OPENAI_API_KEY = os.getenv("OPENAI_API_KEY")
VOICE = 'alloy'
SYSTEM_MESSAGE = "Friendly assistant ready to chat and offer insights."
async def initialize_session(openai_ws):
session_update = {
"type": "session.update",
"session": {
"turn_detection": {
"type": "server_vad",
"threshold": 0.4,
"prefix_padding_ms": 200,
"silence_duration_ms": 300
},
"input_audio_format": "pcm_s16le_16000",
"output_audio_format": "pcm_s16le_16000",
"voice": VOICE,
"instructions": SYSTEM_MESSAGE,
"modalities": ["text", "audio"],
"temperature": 0.8,
"input_audio_transcription": {
"model": "whisper-1"
}
}
}
await openai_ws.send(json.dumps(session_update))
async def vonage_to_openai(vonage_ws, openai_ws):
try:
async for msg in vonage_ws:
if msg.type == WSMsgType.BINARY:
encoded_chunk = base64.b64encode(msg.data).decode('utf-8')
audio_append = {
"type": "input_audio_buffer.append",
"audio": encoded_chunk
}
await openai_ws.send(json.dumps(audio_append))
commit_event = {
"type": "input_audio_buffer.commit"
}
await openai_ws.send(json.dumps(commit_event))
else:
logger.warning("Non-binary message received from Vonage.")
except Exception as e:
logger.error("Error in vonage_to_openai: %s", e)
async def openai_to_vonage(openai_ws, vonage_ws):
try:
while True:
message = await openai_ws.recv()
data = json.loads(message)
event_type = data.get('type')
if event_type == 'response.audio.delta':
audio_base64 = data.get('delta')
if audio_base64:
audio_bytes = base64.b64decode(audio_base64)
frame_size = 640
for i in range(0, len(audio_bytes), frame_size):
chunk = audio_bytes[i:i + frame_size]
await vonage_ws.send_bytes(chunk)
await asyncio.sleep(0.02)
elif event_type == 'response.done':
logger.info("Response generation completed.")
except Exception as e:
logger.error("Error in openai_to_vonage: %s", e)
Troubleshooting Steps Taken
- Audio Format Validation: Ensured both Vonage and OpenAI are set to use PCM 16-bit Linear, 16kHz audio format. Set
input_audio_format
andoutput_audio_format
insession.update
topcm_s16le_16000
. - Logging: Verified that audio data is being sent from Vonage and received by OpenAI.
- Resampling: Removed any unnecessary resampling to avoid format mismatches between Vonage and OpenAI.
- Data Encoding: Audio data is base64-encoded correctly for transmission to OpenAI and decoded correctly from OpenAI responses.
Despite these adjustments, the assistant’s voice remains low-pitched and slow, and there is no indication that OpenAI is receiving or processing the user’s speech.