Timestamp_granularities="word" does not match generated transcript

When using timestamp_granularities=“word” I get the following transcript:

I went to school when I was in a test read, you know. My mother and father left me standing on a street corner. It's always up to me to make it.

and the following word object:

{
    "duration": 9.989999771118164,
    "segments": [
        {
            "word": "I",
            "start": 1.2000000476837158,
            "end": 1.4800000190734863
        },
        {
            "word": "went",
            "start": 1.4800000190734863,
            "end": 1.6799999475479126
        },
        {
            "word": "to",
            "start": 1.6799999475479126,
            "end": 1.7999999523162842
        },
        {
            "word": "school",
            "start": 1.7999999523162842,
            "end": 1.8600000143051147
        },
        {
            "word": "when",
            "start": 1.8600000143051147,
            "end": 2.0199999809265137
        },
        {
            "word": "I",
            "start": 2.0199999809265137,
            "end": 2.0199999809265137
        },
        {
            "word": "was",
            "start": 2.0199999809265137,
            "end": 2.0799999237060547
        },
        {
            "word": "in",
            "start": 2.0799999237060547,
            "end": 2.180000066757202
        },
        {
            "word": "a",
            "start": 2.180000066757202,
            "end": 2.240000009536743
        },
        {
            "word": "test",
            "start": 2.240000009536743,
            "end": 2.3399999141693115
        },
        {
            "word": "read",
            "start": 2.3399999141693115,
            "end": 2.4600000381469727
        },
        {
            "word": "you",
            "start": 2.4600000381469727,
            "end": 2.8399999141693115
        },
        {
            "word": "know",
            "start": 2.8399999141693115,
            "end": 2.8399999141693115
        },
        {
            "word": "My",
            "start": 3.680000066757202,
            "end": 4.539999961853027
        },
        {
            "word": "mother",
            "start": 4.539999961853027,
            "end": 4.619999885559082
        },
        {
            "word": "and",
            "start": 4.619999885559082,
            "end": 4.860000133514404
        },
        {
            "word": "father",
            "start": 4.860000133514404,
            "end": 4.860000133514404
        },
        {
            "word": "left",
            "start": 4.860000133514404,
            "end": 4.920000076293945
        },
        {
            "word": "me",
            "start": 4.920000076293945,
            "end": 5.099999904632568
        },
        {
            "word": "standing",
            "start": 5.099999904632568,
            "end": 5.159999847412109
        },
        {
            "word": "on",
            "start": 5.159999847412109,
            "end": 5.300000190734863
        },
        {
            "word": "a",
            "start": 5.300000190734863,
            "end": 5.300000190734863
        },
        {
            "word": "street",
            "start": 5.300000190734863,
            "end": 5.440000057220459
        },
        {
            "word": "corner",
            "start": 5.440000057220459,
            "end": 5.599999904632568
        },
        {
            "word": "It's",
            "start": 6.519999980926514,
            "end": 7.079999923706055
        },
        {
            "word": "always",
            "start": 7.079999923706055,
            "end": 7.21999979019165
        },
        {
            "word": "up",
            "start": 7.21999979019165,
            "end": 7.320000171661377
        },
        {
            "word": "to",
            "start": 7.320000171661377,
            "end": 7.360000133514404
        },
        {
            "word": "me",
            "start": 7.360000133514404,
            "end": 7.420000076293945
        },
        {
            "word": "to",
            "start": 7.420000076293945,
            "end": 7.559999942779541
        },
        {
            "word": "make",
            "start": 7.559999942779541,
            "end": 7.579999923706055
        },
        {
            "word": "it",
            "start": 7.579999923706055,
            "end": 7.760000228881836
        },
        {
            "word": "I",
            "start": 8.5,
            "end": 9.300000190734863
        },
        {
            "word": "had",
            "start": 9.300000190734863,
            "end": 9.380000114440918
        },
        {
            "word": "a",
            "start": 9.380000114440918,
            "end": 9.520000457763672
        },
        {
            "word": "job",
            "start": 9.520000457763672,
            "end": 9.680000305175781
        },
        {
            "word": "Mama",
            "start": 9.779999732971191,
            "end": 9.880000114440918
        }
    ]
}

Note that the words “I”, “had”, “a”, “job”, “Mama” at the end of the words array do not exist in the original transcription.

Here is the audio file: out006.mp3 - Google Drive (I realise the quality is garbage, but I would like a guarantee that the words in both returned transcripts are the same)

Is there anything I can do to guard against this issue?

Either that or a way to get punctuation included at the word level?

  1. transcribe the audio at normal speed instead of a time-slicing speedup
  2. normalize the audio by +18dB so it can be heard.

The word granularites is likely not tuned for 10 words a second.

As I said, I know the file is garbage - I just tested a wide variety of files and it failed on this particular one. I added this one because it it useful to show my point. I will use this function to process hundreds of thousands of user-submitted files and I want to be guaranteed that the words array will match up with the transcript returned. This has not been the case in several of the files I have tested.

You cannot improve the AI model. It is what it is.

You can improve your inputs.

You can avoid sending damaged files, and offer a “service” by improving what the user could do themselves. ffmpeg-normalize to standard levels, adaptive gain, passband limiting, etc.

If you’re going to try to submit some degree of speedup, you can do it without a time bin to frequency FFT lossy codec.

user submitted:

remove some damage: