I have a proof-of-concept language learning app that wraps the chat API with voice-to-text & text-to-voice, so users can interface with ChatGPT via audio in another (supported) language of their choice. The problem is there is crazy latency in the response times so it’s not currently viable for real world use. The feature I’m seeking is a combination of Speech-to-Text & Chat APIs, such that the standard messages data structure can be sent along with an audio file (or even better a stream). The ask would be you run the audio through the Speech-to-Text, then pass that as the latest user message to the Chat API. The response would be a bit different than the current chat API response, because I’d expect it to contain the transcribed audio in that user message, plus the latest assistant message. This would dramatically reduce latency for the first two-thirds of my API calls.
Is this something that could be implemented via a plugin? I’ve just joined the waiting list…
You might look at splitting the audio up into a few words at a time input wise, might also look at turning on the stream=true option when making the LLM call and then passing every couple of words returned to a TTS engine… might be able to get latency down to a few seconds each way.
Would need some experimentation and testing though.
Where is the latency, actually?
Whisper doesn’t have a streaming mode, so you’ll have trouble getting real time data out of there.
There are other solutions. I personally use assembly.ai, and they are OK, but the “real time” is still somewhat delayed.
Also, the LLM doesn’t really allow you to feed it one token at a time, so you’d have to do a lot of API calls to get a few tokens out at a time, with all of the context provided into the model for each call. That’s pretty inefficient and expensive – you might be better off with a locally hosted model, where you can keep the state in GPU between invocations. (Then again – you’ll have to provision and pay for that GPU, per parallel data stream …)
Not sure I follow you there, feeding tokens into the model is essentially zero lag, with streaming enabled you get output tokens with in a few hundred milliseconds, you can send whisper a few words worth of audio at a time to get a faster response from spoken input, so by the time the user has spoken their last word and the LLM needs to get to work you already have 95% of the text transcribed and only need to wait for the last few words, then for audio output you send only a few words to the TTS at a time to keep that latency down.
What are you using for audio transcription? You are blocking on each step yet you could be performing transcription live.
Also, you should be chunking your response as sentence segments to your TTS provider and then streaming the audio result back
I am using the same workflow without all the blocking for casual conversation and get a response within seconds. To be fair though I built mine for Android so I can take advantage of all the nice hardware and android (google mainly)-specific frameworks. I imagine a browser would be more difficult.
This is a heavy workload for a browser to process. I imagine you are using API calls for every step?
No, streaming is output only. Unless you are sending megabytes of data down a very slow pipe, it should make almost no difference, even the largest 8K prompt would be 32Kb of data to be sent (roughly) and the additional lag created by sending… lets say 300 tokens of data to the model will not make a significant difference.
Ok, @Foxabilo I didn’t think inbound streaming was supported, but you still have to wait for the connection to open, and I have to wait for the response from speech-to-text before I can even start the call to the chat endpoint. So streaming is only useful in the last API hit where I call text-to-speech. Unless I’m missing something.
correct, but a human to human interaction has no expectancy of reply during the speaking phase, each participant would wait until the entirety of the others message has been delivered before responding, although humans may do some light pre processing of key points during speech, the majority of the processing is done after the last end of sentence marker, whatever that may be.
So while having to wait for the entirety of the message to be converted to text, I don’t think that alone will make things unbearably long, it’s the time to first reply syllable that’s key here, and I see no reason why under ideal circumstances this could not be sub 1000ms
I’m currently using google for Speech-to-Text (because it’s what I’ve used in the past), but if OpenAI could provide a consolidated API where the latest user message were passed as audio, I would move to them.
For TTS, I’m using Watson, I’ll have to see if they have that chunking support for input.
I did look at some offline options for voice-to-text in the browser, but nothing looked enticing.
I’ll dig into some of these input options, if I could stream input to the services, I’d imagine that could speed things up.
It looks like Google speech-to-text does support streaming input, paired w/ WebRTC looks promising! However reading the docs for Google & Watson text-to-speech, both require the full text to be submitted before a response is produced.
Drilling into your suggestion to chunk text as sentence segments, I assume you mean something like this?
Use stream parameter on call to OpenAI chat
Watch for punctuation in the response, when a complete sentence is detected, create a request to TTS service
Feed TTS response streams through common output stream to browser?
My only concern is the merged streams could be clunky, like if one subsequent stream doesn’t start before the current one completes, or the junction between two sentences. Either way, sounds good enough to try
Indeed! It’s a PITA. I have a crappy mobile provider and run into this issue… a lot. I feel like some prediction and bandwidth monitoring could be a solution but I never got this far.
Slightly off-topic but there’s a new trending GitHub repo that does local TTS.
I would try to stream at every opportunity and connect them together using some sort of a buffer. So, yes, exactly! Most of my delay comes from my TTS (but I am using ElevenLabs) so streaming the GPT output and chunking it by sentence, and then streaming the result from ElevenLabs has worked well for me when my connection is ok.
I am considering using the above repo to host the TTS part locally! ElevenLabs is expensive asf so I also have set a toggle to use Google’s TTS but it is boring
Reading through this thread and thinking of my own experiences with the long wait time between recording end, transcription end and tts return I just had the idea to create an artificial split, for example a few words in when the transcription arrives. One could then send this first short part for tts and send more chunks in parallel, with the responsibility to puzzle the pieces back together when playing them to the user.
I suppose the biggest downside could be a loss of nuances when the TTS engine cannot correctly infere how to modularize the voice but it may be worth a try.
The two steps you can fix somewhat easily there, would be to use a streaming transcription API, and use streaming audio capture/forward to that API.
That would get you words back one or a few seconds after they are spoken.
You can then wait until the user is done, before you infer the answer, so you still have the OpenAI call overhead. You can also stream out when sending the result to text-to-speech.
All in all, I think you could get the latency to be “reasonable.”
E g, if the last transcribed word arrives one second after the user stops speaking, and you then send it all off for inference, and then you stream the result and start feeding it to text-to-speech, you might see a lag of only a few seconds.
I really appreciate the brainstorm and suggestions from everyone. All this to say though, there is still a blocking component of the flow from the initial transcoding to calling the GPT chat API w/ the transcoded text. So I would say my original ask is still relevant, which is OpenAI providing an endpoint which could transcode and then feed that into a chat API call as the last user message. This would eliminate a round trip over the Internet.
The other suggestions about streaming in audio, and chunking text to audio on the way out are still relevant, and would be layered on top of this new endpoint
I think that you might be a bit hung up on the transport times and milliseconds of overhead difference between a whisper API server and a GPT API server, they could be be on opposite sides of the world, such is the nature of distributed load balancing.