Personally I would look into something in the middle between the raw sound stream and whisper to chop the sound in pieces delimited by “lower volume/longer duration” breaks or even use 2 APIs simultaneously: one for speaker diarization + sentence detection then the output of it sent to whisper for more precise speech to text (as both: sound and prompt). But not really expert in this (yet)
1 Like
Personally I would see for a way to use grpc for this type of tasks
Thank you for the advices Serge I will take these into consideration.
1 Like