I was able to get some preliminary results with streaming + playing back audio in real-time using pyaudio:
import os
import requests
from time import time
import pyaudio
url = "https://api.openai.com/v1/audio/speech"
headers = {
"Authorization": f'Bearer {os.getenv("OPENAI_API_KEY")}',
}
data = {
"model": "tts-1",
"input": "This is a test",
"voice": "shimmer",
"response_format": "wav",
}
start_time = time()
response = requests.post(url, headers=headers, json=data, stream=True)
if response.status_code == 200:
print(f"Time to first byte: {int((time() - start_time) * 1000)} ms")
p = pyaudio.PyAudio()
stream = p.open(format=8, channels=1, rate=24000, output=True)
for chunk in response.iter_content(chunk_size=1024):
stream.write(chunk)
print(f"Time to complete: {int((time() - start_time) * 1000)} ms")
HEADPHONE WARNING: this can cause very harsh noise, especially on longer inputs. Keep volume low.
The best part is that this drastically reduces latency to about 200-500 ms (time to first byte). I found latency was lowest with "response_format": "wav", though the trade-off is larger file size. But on any decent connection, the bottleneck will still be generation speed, not network.
I got the values for format, channels and rate by writing the stream to a .wav file and analyzing it as per pyaudio docs:
import os
import requests
import io
import wave
import pyaudio
# url = ...
# headers = ...
# data = ...
response = requests.post(url, headers=headers, json=data, stream=True)
if response.status_code == 200:
buffer = io.BytesIO()
for chunk in response.iter_content(chunk_size=1024):
buffer.write(chunk)
with open("speech.wav", "wb") as f:
f.write(buffer.getvalue())
with wave.open('speech.wav', 'rb') as wf:
p = pyaudio.PyAudio()
print('format', p.get_format_from_width(wf.getsampwidth()))
print('channels', wf.getnchannels())
print('rate', wf.getframerate())
But my guess is that this is wrong, and is the cause of the intermittent noise.