I am using Whisper API and I can’t figure out this.
It happens if the audio starts in the middle of the sentence, it will skip a large part of the transcription. Is this intentional, it waits for the next logical segment to start?
Here is one example
And here is the transcription I got:
“What do you think is his greatest strength? I think people have been talking in the past 12 months or so about his game consisting of certain elements from Roger, Rafa, and myself. I would agree with that. He’s got the best of all three worlds. He’s got this mental resilience and maturity for someone who is 20 years old. It’s quite impressive. He’s got this Spanish bull mentality of competitiveness and fighting spirit and incredible defense that we’ve seen with Rafa over the years. I think he’s got some nice sliding backhands that he’s got.”
It happens not so rarely, I could upload some more examples if needed?
That is pretty odd, but you are not the first person I have heard this from. I’ve never had this problem though. However my situation is a little bit different.
I have a voice activity detector that I made from a tutorial online. Made it using this article.
It naturally breaks up sentences and pauses, then I process each section. The average section is about 7 seconds long. I get really good results. I didn’t do this method on purpose though, it is for voice commands so it was a natural choice.
Maybe a similar approach can work for you. If not, maybe preprocessing the audio to maintain a certain quality can help.
I am using HARK library for voice recognition. In this case, I am cutting audio in chunks for around 90 seconds each, with HARK recognizing the pause when to cut.
What did you have in mind about preprocessing? What would help in this case - I don’t think the audio here is too bad?
Not being an American speaker might also affect it a bit, if the training doesn’t have as much data to discern patterns of unusual voices and you additionally drop it right in the middle of a sentence without context, along with the interchange being two different speakers of different quality.
How will it perform if you have overlaps of different lengths, I wonder.
I can also imagine a warmup may help. For a supervised job, you could place clean strippable short speech of the primary speaker at the start of each. This also ensures no glitches, like it deciding from initial garble that the rest of the transcript must be in the wrong language.
If this is a series of audio data, you can use the result of the previous transcription as
prompt in the next whisper API call.
I tested in some simple audio data where I cut the first audio in mid-sentence and second audio continues it. In my case, if prompting is used, it seamlessly connect the transcription from previous one.
If you need to break it up, then break it into 25 Meg pieces, the largest the model accepts. It’s probably using context of surrounding words … so mid-sentence chops every 90 seconds could be the issue.
Well if HARK is successfully detecting the voices and appropriately sending the audio to your processing pipeline, then it might not be the audio. I was initially thinking that the audio for the interviewer was kind of poor and there were enough artifacts to push it outside the range of the average voice regularly enough that only tiny snippets of it would get captured. This happened to me occasionally with my poor mic of similar quality.
Buttt… that doesn’t account for the first speakers missing transcripts. Depending on how mature/old your project is, maybe you should look at the audio snippet storage. Maybe there is some accidental overwriting happening. I was doing this one of the first days of a voice project I am doing.
This sounds like a good idea, I will give it a try.
@curt.kennedy I am cutting audio in 90 seconds because I need the transcribed data earlier during live events. That means I want transcription to appear in chunks and I think 90 seconds is good balance between having quick transcription and not cutting it into too short chunks.
If you are having boundary issues, you might try to employ some overlap strategy.
Overlap by 10 seconds or so on each end, then correlate and remove the repeated stuff inside the overlap region. You can try various combinations of this, but overlap and post processing should help solve or severely reduce these issues.
I was just conversing with ChatGPT this morning on possible solutions like this. It recommended Fuzzy Wuzzy library for Python which could help with this. I guess I would need to recognize how many words are part of these 10 seconds in both chunks - and then compare both chunks in order to remove the repeated words.
I was thinking using embedding vectors to compare the overlap vs. non-overlap cases. You could embed each word, and compare the overlap, or chunks of words, or the whole 90 second chunk of words until you get good agreement in the embedding vectors.
You could also compare the strings directly, that’s easy but could be risky.