i am adding a voice feature in a chat application, at first i was thinking to use the whisper with the assistant. however i am thinking to attach audio with the assistant directly to reduce the latency, any thoughts on this approach?
No API language models currently support multimodal audio input, so your approach doesn’t seem approachable at this time.
No models support doing anything productive such as pre-processing partial text input, so you also would have to send the completed thought to the AI model.
There are however techniques you could use to accelerate the whisper transcription. Foremost would be to use silence detection within audio to give it split points, where chunks of audio could be sent to whisper, either your local open-source version or that on OpenAI’s API.
This could be to run parallel transcriptions on long audio, or to transcribe while the speech input is still ongoing. The “prompt” parameter of API allows you to feed previous transcription back in as a starting point to continue on, if you were to produce a streaming transcriber.
Silence doesn’t necessarily indicate the end of a sentence, so the re-joined product may not be of the same quality, but the AI can usually tolerate and overlook a few misinterpreted words.
Hi @siddiquiowais390 and welcome to the community!
My gut feeling is that, since Assistants API is in general much slower and is a bottleneck, it won’t make much of a difference. But I would be interested to see your results/comparison if you do try!
Another approach, if you want to cut down transcription latency, is to use a fast local solution for that. I would recommend looking at whisper.cpp since it’s notoriously fast. But depends of course how you run your app.