Seeking Guidance on Whisper API for End of Speech Detection for Transcription

Hello Everyone,

I’m currently working with OpenAI’s Whisper API and have been pleased with the results, particularly in terms of the speech recognition quality it provides. My project involves developing an application where the functionality is centered around the user speaking into a microphone and then having a transcription of their speech displayed once they finish speaking.

As I delve deeper into the process, I’ve identified a crucial need for an effective method to detect when the user has finished speaking. This end-of-speech detection would allow the system to trigger the transcription process, providing a streamlined user experience where their speech is transcribed only after they’ve concluded their thoughts.

I have thoroughly gone through the Whisper API documentation, but haven’t been able to find specific details about that.

So, my question is, does the Whisper API provide any capabilities or mechanisms to identify when a user stops speaking, and only then initiate the transcription process? I realize that this may not be a straightforward problem and there might be various factors at play. However, I’d appreciate any pointers or directions.

Thank you in advance for your help.

Hi @DawidM

The audio transcribe API simply takes an audio file of supported format under 25MB and a model name, among other params, and returns its transcript in the requested format.

The feature you suggested will have to be developed in your client application.

I recommend reading docs and API reference to for a better understanding.