I’ve been using the whisper model for speech-to-text, and in my testing I’ve been running the transcription on the same short audio file hundreds of times (a 2 minute audio segment from a stand up comedy show).
One single time yesterday, whisper responded not only with the transcription of the speech by the comedian, but also with additional context in the JSON response that indicated laughter, applause, and cheers!
I ran lieterally the same calls after that, and the laughter, applause, and cheers indications in the JSON were gone.
- Does anyone know why that might have happened?
- How can I bring back the laughter/applause/cheers to appear?
- Does any other speech-to-text model support these events?
I am using whisper-1, with response_format=verbose_json and timestamp_granularities=[“word”]
Example of when I got it:
“{
“word”: “stage”,
“start”: 10.600000381469727,
“end”: 10.600000381469727
},
{
“word”: “Mateo”,
“start”: 10.680000305175781,
“end”: 11.579999923706055
},
{
“word”: “Ray”,
“start”: 11.579999923706055,
“end”: 12.15999984741211
}
]
},
{
"text": “【applause & cheers】”,
“start”: 19.940000534057617,
“end”: 21.940000534057617,
“words”: [
{
"word": “【applause”,
“start”: 19.940000534057617,
“end”: 20.68000030517578
},
{
“word”: “”,
“start”: 20.68000030517578,
“end”: 21.420000076293945
},
{
“word": “cheers】”,
“start”: 21.420000076293945,
“end”: 21.940000534057617
}
]
},
{
“text”: “Hey. Thank you, thank you.”,
“start”: 22.639999389648438,
“end”: 26.6200008392334,
“words”: [”
Vs running it again today, and I don’t see these events:
“{
“word”: “stage”,
“start”: 10.180000305175781,
“end”: 10.579999923706055
},
{
“word”: “Matteo”,
“start”: 10.699999809265137,
“end”: 11.65999984741211
},
{
“word”: “Ray”,
“start”: 11.65999984741211,
“end”: 12.34000015258789
}
]
},
{
“text”: “Thank you, thank you, thank you.”,
“start”: 23.68000030517578,
“end”: 29.239999771118164,
“words”: [”“