Timestamp_granularities="word" does not match generated transcript

daraj · May 1, 2024, 3:35pm

When using timestamp_granularities=“word” I get the following transcript:

I went to school when I was in a test read, you know. My mother and father left me standing on a street corner. It's always up to me to make it.

and the following word object:

{
    "duration": 9.989999771118164,
    "segments": [
        {
            "word": "I",
            "start": 1.2000000476837158,
            "end": 1.4800000190734863
        },
        {
            "word": "went",
            "start": 1.4800000190734863,
            "end": 1.6799999475479126
        },
        {
            "word": "to",
            "start": 1.6799999475479126,
            "end": 1.7999999523162842
        },
        {
            "word": "school",
            "start": 1.7999999523162842,
            "end": 1.8600000143051147
        },
        {
            "word": "when",
            "start": 1.8600000143051147,
            "end": 2.0199999809265137
        },
        {
            "word": "I",
            "start": 2.0199999809265137,
            "end": 2.0199999809265137
        },
        {
            "word": "was",
            "start": 2.0199999809265137,
            "end": 2.0799999237060547
        },
        {
            "word": "in",
            "start": 2.0799999237060547,
            "end": 2.180000066757202
        },
        {
            "word": "a",
            "start": 2.180000066757202,
            "end": 2.240000009536743
        },
        {
            "word": "test",
            "start": 2.240000009536743,
            "end": 2.3399999141693115
        },
        {
            "word": "read",
            "start": 2.3399999141693115,
            "end": 2.4600000381469727
        },
        {
            "word": "you",
            "start": 2.4600000381469727,
            "end": 2.8399999141693115
        },
        {
            "word": "know",
            "start": 2.8399999141693115,
            "end": 2.8399999141693115
        },
        {
            "word": "My",
            "start": 3.680000066757202,
            "end": 4.539999961853027
        },
        {
            "word": "mother",
            "start": 4.539999961853027,
            "end": 4.619999885559082
        },
        {
            "word": "and",
            "start": 4.619999885559082,
            "end": 4.860000133514404
        },
        {
            "word": "father",
            "start": 4.860000133514404,
            "end": 4.860000133514404
        },
        {
            "word": "left",
            "start": 4.860000133514404,
            "end": 4.920000076293945
        },
        {
            "word": "me",
            "start": 4.920000076293945,
            "end": 5.099999904632568
        },
        {
            "word": "standing",
            "start": 5.099999904632568,
            "end": 5.159999847412109
        },
        {
            "word": "on",
            "start": 5.159999847412109,
            "end": 5.300000190734863
        },
        {
            "word": "a",
            "start": 5.300000190734863,
            "end": 5.300000190734863
        },
        {
            "word": "street",
            "start": 5.300000190734863,
            "end": 5.440000057220459
        },
        {
            "word": "corner",
            "start": 5.440000057220459,
            "end": 5.599999904632568
        },
        {
            "word": "It's",
            "start": 6.519999980926514,
            "end": 7.079999923706055
        },
        {
            "word": "always",
            "start": 7.079999923706055,
            "end": 7.21999979019165
        },
        {
            "word": "up",
            "start": 7.21999979019165,
            "end": 7.320000171661377
        },
        {
            "word": "to",
            "start": 7.320000171661377,
            "end": 7.360000133514404
        },
        {
            "word": "me",
            "start": 7.360000133514404,
            "end": 7.420000076293945
        },
        {
            "word": "to",
            "start": 7.420000076293945,
            "end": 7.559999942779541
        },
        {
            "word": "make",
            "start": 7.559999942779541,
            "end": 7.579999923706055
        },
        {
            "word": "it",
            "start": 7.579999923706055,
            "end": 7.760000228881836
        },
        {
            "word": "I",
            "start": 8.5,
            "end": 9.300000190734863
        },
        {
            "word": "had",
            "start": 9.300000190734863,
            "end": 9.380000114440918
        },
        {
            "word": "a",
            "start": 9.380000114440918,
            "end": 9.520000457763672
        },
        {
            "word": "job",
            "start": 9.520000457763672,
            "end": 9.680000305175781
        },
        {
            "word": "Mama",
            "start": 9.779999732971191,
            "end": 9.880000114440918
        }
    ]
}

Note that the words “I”, “had”, “a”, “job”, “Mama” at the end of the words array do not exist in the original transcription.

Here is the audio file: out006.mp3 - Google Drive (I realise the quality is garbage, but I would like a guarantee that the words in both returned transcripts are the same)

Is there anything I can do to guard against this issue?

daraj · May 1, 2024, 4:58pm

Either that or a way to get punctuation included at the word level?

_j · May 1, 2024, 6:49pm

transcribe the audio at normal speed instead of a time-slicing speedup
normalize the audio by +18dB so it can be heard.

The word granularites is likely not tuned for 10 words a second.

daraj · May 1, 2024, 7:03pm

As I said, I know the file is garbage - I just tested a wide variety of files and it failed on this particular one. I added this one because it it useful to show my point. I will use this function to process hundreds of thousands of user-submitted files and I want to be guaranteed that the words array will match up with the transcript returned. This has not been the case in several of the files I have tested.

_j · May 1, 2024, 7:21pm

You cannot improve the AI model. It is what it is.

You can improve your inputs.

You can avoid sending damaged files, and offer a “service” by improving what the user could do themselves. ffmpeg-normalize to standard levels, adaptive gain, passband limiting, etc.

If you’re going to try to submit some degree of speedup, you can do it without a time bin to frequency FFT lossy codec.

user submitted:

remove some damage:

tevslin · August 21, 2024, 2:05am

using the api with fairly high quality audio I get words in the transcript which are not in word level segmentation and words in the word level segmentation which are not in the transcript. In either case the words are legitimately in the audio which is a superset of both. is this somehow a two pass process?

Topic		Replies	Views
Discrepancy in segment level vs word level time stamps with whisper API API	0	935	May 4, 2024
Word-Level and Sentence-Level Transcript Timestamps Do Not Match Bugs whisper	0	603	April 4, 2024
Word level transcription data? Bugs	2	944	February 28, 2024
Impossible to get basic transcription with API Bugs whisper	0	213	May 13, 2024
Inconsistency in segment-level timestamps Bugs whisper	0	425	April 25, 2024

Timestamp_granularities="word" does not match generated transcript

Related topics