Timestamp_granularities="word" does not match generated transcript

When using timestamp_granularities=“word” I get the following transcript:

I went to school when I was in a test read, you know. My mother and father left me standing on a street corner. It's always up to me to make it.

and the following word object:

{
    "duration": 9.989999771118164,
    "segments": [
        {
            "word": "I",
            "start": 1.2000000476837158,
            "end": 1.4800000190734863
        },
        {
            "word": "went",
            "start": 1.4800000190734863,
            "end": 1.6799999475479126
        },
        {
            "word": "to",
            "start": 1.6799999475479126,
            "end": 1.7999999523162842
        },
        {
            "word": "school",
            "start": 1.7999999523162842,
            "end": 1.8600000143051147
        },
        {
            "word": "when",
            "start": 1.8600000143051147,
            "end": 2.0199999809265137
        },
        {
            "word": "I",
            "start": 2.0199999809265137,
            "end": 2.0199999809265137
        },
        {
            "word": "was",
            "start": 2.0199999809265137,
            "end": 2.0799999237060547
        },
        {
            "word": "in",
            "start": 2.0799999237060547,
            "end": 2.180000066757202
        },
        {
            "word": "a",
            "start": 2.180000066757202,
            "end": 2.240000009536743
        },
        {
            "word": "test",
            "start": 2.240000009536743,
            "end": 2.3399999141693115
        },
        {
            "word": "read",
            "start": 2.3399999141693115,
            "end": 2.4600000381469727
        },
        {
            "word": "you",
            "start": 2.4600000381469727,
            "end": 2.8399999141693115
        },
        {
            "word": "know",
            "start": 2.8399999141693115,
            "end": 2.8399999141693115
        },
        {
            "word": "My",
            "start": 3.680000066757202,
            "end": 4.539999961853027
        },
        {
            "word": "mother",
            "start": 4.539999961853027,
            "end": 4.619999885559082
        },
        {
            "word": "and",
            "start": 4.619999885559082,
            "end": 4.860000133514404
        },
        {
            "word": "father",
            "start": 4.860000133514404,
            "end": 4.860000133514404
        },
        {
            "word": "left",
            "start": 4.860000133514404,
            "end": 4.920000076293945
        },
        {
            "word": "me",
            "start": 4.920000076293945,
            "end": 5.099999904632568
        },
        {
            "word": "standing",
            "start": 5.099999904632568,
            "end": 5.159999847412109
        },
        {
            "word": "on",
            "start": 5.159999847412109,
            "end": 5.300000190734863
        },
        {
            "word": "a",
            "start": 5.300000190734863,
            "end": 5.300000190734863
        },
        {
            "word": "street",
            "start": 5.300000190734863,
            "end": 5.440000057220459
        },
        {
            "word": "corner",
            "start": 5.440000057220459,
            "end": 5.599999904632568
        },
        {
            "word": "It's",
            "start": 6.519999980926514,
            "end": 7.079999923706055
        },
        {
            "word": "always",
            "start": 7.079999923706055,
            "end": 7.21999979019165
        },
        {
            "word": "up",
            "start": 7.21999979019165,
            "end": 7.320000171661377
        },
        {
            "word": "to",
            "start": 7.320000171661377,
            "end": 7.360000133514404
        },
        {
            "word": "me",
            "start": 7.360000133514404,
            "end": 7.420000076293945
        },
        {
            "word": "to",
            "start": 7.420000076293945,
            "end": 7.559999942779541
        },
        {
            "word": "make",
            "start": 7.559999942779541,
            "end": 7.579999923706055
        },
        {
            "word": "it",
            "start": 7.579999923706055,
            "end": 7.760000228881836
        },
        {
            "word": "I",
            "start": 8.5,
            "end": 9.300000190734863
        },
        {
            "word": "had",
            "start": 9.300000190734863,
            "end": 9.380000114440918
        },
        {
            "word": "a",
            "start": 9.380000114440918,
            "end": 9.520000457763672
        },
        {
            "word": "job",
            "start": 9.520000457763672,
            "end": 9.680000305175781
        },
        {
            "word": "Mama",
            "start": 9.779999732971191,
            "end": 9.880000114440918
        }
    ]
}

Note that the words “I”, “had”, “a”, “job”, “Mama” at the end of the words array do not exist in the original transcription.

Here is the audio file: out006.mp3 - Google Drive (I realise the quality is garbage, but I would like a guarantee that the words in both returned transcripts are the same)

Is there anything I can do to guard against this issue?

Either that or a way to get punctuation included at the word level?

  1. transcribe the audio at normal speed instead of a time-slicing speedup
  2. normalize the audio by +18dB so it can be heard.

The word granularites is likely not tuned for 10 words a second.

1 Like

As I said, I know the file is garbage - I just tested a wide variety of files and it failed on this particular one. I added this one because it it useful to show my point. I will use this function to process hundreds of thousands of user-submitted files and I want to be guaranteed that the words array will match up with the transcript returned. This has not been the case in several of the files I have tested.

You cannot improve the AI model. It is what it is.

You can improve your inputs.

You can avoid sending damaged files, and offer a “service” by improving what the user could do themselves. ffmpeg-normalize to standard levels, adaptive gain, passband limiting, etc.

If you’re going to try to submit some degree of speedup, you can do it without a time bin to frequency FFT lossy codec.

user submitted:

remove some damage:

1 Like

using the api with fairly high quality audio I get words in the transcript which are not in word level segmentation and words in the word level segmentation which are not in the transcript. In either case the words are legitimately in the audio which is a superset of both. is this somehow a two pass process?