I am developing an iPhone app that can converse in real time using the ChatGPT API.
- Transcribe audio to text using Whisper.
- Send the transcription hands-free to the ChatGPT API.
- Stream ChatGPT’s responses in real time on the chat interface as text.
- Once the response is complete, use Text to Speech to vocalize the text.
I have managed to implement up to step 3, but there is a noticeable lag between the completion of step 3 and the start of step 4 when conversing hands-free. I saw on the OpenAI site that streaming real-time audio is possible. I would appreciate it if someone who has experience with this could share their insights.
https://platform.openai.com/docs/guides/text-to-speech
1 Like
I’m working on a similar project and was wondering if you managed to resolve the issue with the noticeable lag between steps 3 and 4.
If you were able to solve it, I would greatly appreciate it if you could help me with my project as well. I would be happy to discuss the details and terms of collaboration.
1 Like
Hi,
It’s great to see this. I had a similar idea, but I am still researching the tech stack. I found out that many platforms have some sort of text-to-speech API for accessibility, like speechSynthesis
in the Web API, but the quality is worse.
I am also curious if there is any way for us to call a sequence of OpenAI APIs, but it seems like there isn’t. I guess the closest we can get is to have your server deployed on Azure.
Real-time apps are super sensitive to lags, so we should find a way to manage that properly. If you have any good solutions, please keep us updated.
2 Likes