Is there an API that recognizes from voice data and responds to voice data?

Is there an API that not only sends an audio file to OpenAI Server and generates text, but also recognizes the audio data and responds audio data.

I know speechToText API and TextToSpeech is prepared.
But, if I try to recognize form voice data and responds to voice data, I will connect to OpenAI Server three times.

【Three times】
First:SpeechToText
Second:Request chatText
Third:TextToSpeech

I want to know an API that can perform the above processing in one time.

This is a great idea and much needed for doing a normal speech conversational flow. They could make it where the upstream and downstream audio channels are kept open long term, so that you can actually interrupt the response speech, and say “stop, you misunderstood” or whatever, to cut them off, just like with a human.

Or maybe not everyone’s rude and cuts people off. haha. But I don’t think this API feature exists yet. Hopefully they are working on it!

2 Likes