We allows users to interact with our product via voice. We use OpenAI’s speech to text. One problem with that is that it does not know that speaker has finished speaking. If we use 1 second delay then it takes a while for the user to hear a response because this text needs to be fed to OpenAI to respond and response need to be converted into voice. If we use a smaller “silence” period, it fails too because humans often will have such brief moments of silence. Would appreciate any suggestions on how to overcome this.
Hey @logankilpatrick: We are a startup and dont want to spend time and money on things that OpenAI plans to improve in short-term. One such things is voice. Would you please advise, base don your release schedule, which of these I should not even touch coz you plan to release in next 3-6 months:
-
API that predicts that user is done speaking and user’s input should be processed to create a response. I understand you may be trying to do this with “silence detection” or some other method such as streaming.
-
API that detects that LLM is taking longer to process and injects appropriate filler words. This may include restating to the user what they said, in case their input was 30-45 seconds long.
Also, I applied for the new forum you announced on Twitter. But did not get any confirmation email. Should I reapply?
Chinmay A. Singh
Founder, iWish AI.
I’m pretty sure that for a good phone call experience, the speech to text must be run locally for a phone call type application. Because latency is so crucial. Speech to text is not that complicated.
Secondly, for (2). It’s already possible. You can check already how long the response takes.
On your server, check if the time since you requested is over X seconds, if so, use a filler word.
Also, you can stream the response of the LLM, so that you can start doing text to speech for the first sentence even if the full response is not done yet.
Hope this helps! But you will have to do some work for this.
Thanks Merlin. We tried using Google’s Speech to Text (local). The problem is that it wipes out filler words. So then we had to use an API and we decided in favor of OpenAI instead of AWS. Do you happen to know any open source (or otherwise) local speech to text that captures filler words?
I agree about your comment on #2 and we tried it. The problem is that we will get the first sentence with streaming and then the next sentence would take longer. Typical in sales discussion where a customer may respond: “No. But when we use OpenAI or any other LLM, we have to wait for the entire answer to come before we can convert it into voice.”