I am developing an iPhone app that can converse in real time using the ChatGPT API.
- Transcribe audio to text using Whisper.
- Send the transcription hands-free to the ChatGPT API.
- Stream ChatGPT’s responses in real time on the chat interface as text.
- Once the response is complete, use Text to Speech to vocalize the text.
I have managed to implement up to step 3, but there is a noticeable lag between the completion of step 3 and the start of step 4 when conversing hands-free. I saw on the OpenAI site that streaming real-time audio is possible. I would appreciate it if someone who has experience with this could share their insights.
https://platform.openai.com/docs/guides/text-to-speech
1 Like
I’m working on a similar project and was wondering if you managed to resolve the issue with the noticeable lag between steps 3 and 4.
If you were able to solve it, I would greatly appreciate it if you could help me with my project as well. I would be happy to discuss the details and terms of collaboration.
1 Like
Hi,
It’s great to see this. I had a similar idea, but I am still researching the tech stack. I found out that many platforms have some sort of text-to-speech API for accessibility, like speechSynthesis in the Web API, but the quality is worse.
I am also curious if there is any way for us to call a sequence of OpenAI APIs, but it seems like there isn’t. I guess the closest we can get is to have your server deployed on Azure.
Real-time apps are super sensitive to lags, so we should find a way to manage that properly. If you have any good solutions, please keep us updated.
2 Likes
I used the following code to stream the audio data in real time using OpenAI’s TTS. It works fine for me.
import pyaudio
from openai import OpenAI
#initialize openai client
client = OpenAI(api_key = "YOUR OPENAI_API_KEY")
# Initialize PyAudio
p = pyaudio.PyAudio()
stream = p.open(format=pyaudio.paInt16, # Adjust based on audio format
channels=1,
rate=24000,
output=True,
frames_per_buffer=8192)
def tts_streaming(text):
first_chunk_sent = False
start = time.time()
with client.audio.speech.with_streaming_response.create(
model="tts-1",
voice="alloy",
input=text,
response_format="pcm"
) as response:
for chunk in response.iter_bytes(4096):
stream.write(chunk)
if not first_chunk_sent:
elapsed_time = time.time() - start
print(f"Time taken to send the first chunk: {elapsed_time:.4f} seconds")
first_chunk_sent = True
time.sleep(1) # Prevent buffer underrun
stream.stop_stream()
stream.close()
p.terminate()