When using timestamp_granularities=“word” I get the following transcript:
I went to school when I was in a test read, you know. My mother and father left me standing on a street corner. It's always up to me to make it.
and the following word object:
{
"duration": 9.989999771118164,
"segments": [
{
"word": "I",
"start": 1.2000000476837158,
"end": 1.4800000190734863
},
{
"word": "went",
"start": 1.4800000190734863,
"end": 1.6799999475479126
},
{
"word": "to",
"start": 1.6799999475479126,
"end": 1.7999999523162842
},
{
"word": "school",
"start": 1.7999999523162842,
"end": 1.8600000143051147
},
{
"word": "when",
"start": 1.8600000143051147,
"end": 2.0199999809265137
},
{
"word": "I",
"start": 2.0199999809265137,
"end": 2.0199999809265137
},
{
"word": "was",
"start": 2.0199999809265137,
"end": 2.0799999237060547
},
{
"word": "in",
"start": 2.0799999237060547,
"end": 2.180000066757202
},
{
"word": "a",
"start": 2.180000066757202,
"end": 2.240000009536743
},
{
"word": "test",
"start": 2.240000009536743,
"end": 2.3399999141693115
},
{
"word": "read",
"start": 2.3399999141693115,
"end": 2.4600000381469727
},
{
"word": "you",
"start": 2.4600000381469727,
"end": 2.8399999141693115
},
{
"word": "know",
"start": 2.8399999141693115,
"end": 2.8399999141693115
},
{
"word": "My",
"start": 3.680000066757202,
"end": 4.539999961853027
},
{
"word": "mother",
"start": 4.539999961853027,
"end": 4.619999885559082
},
{
"word": "and",
"start": 4.619999885559082,
"end": 4.860000133514404
},
{
"word": "father",
"start": 4.860000133514404,
"end": 4.860000133514404
},
{
"word": "left",
"start": 4.860000133514404,
"end": 4.920000076293945
},
{
"word": "me",
"start": 4.920000076293945,
"end": 5.099999904632568
},
{
"word": "standing",
"start": 5.099999904632568,
"end": 5.159999847412109
},
{
"word": "on",
"start": 5.159999847412109,
"end": 5.300000190734863
},
{
"word": "a",
"start": 5.300000190734863,
"end": 5.300000190734863
},
{
"word": "street",
"start": 5.300000190734863,
"end": 5.440000057220459
},
{
"word": "corner",
"start": 5.440000057220459,
"end": 5.599999904632568
},
{
"word": "It's",
"start": 6.519999980926514,
"end": 7.079999923706055
},
{
"word": "always",
"start": 7.079999923706055,
"end": 7.21999979019165
},
{
"word": "up",
"start": 7.21999979019165,
"end": 7.320000171661377
},
{
"word": "to",
"start": 7.320000171661377,
"end": 7.360000133514404
},
{
"word": "me",
"start": 7.360000133514404,
"end": 7.420000076293945
},
{
"word": "to",
"start": 7.420000076293945,
"end": 7.559999942779541
},
{
"word": "make",
"start": 7.559999942779541,
"end": 7.579999923706055
},
{
"word": "it",
"start": 7.579999923706055,
"end": 7.760000228881836
},
{
"word": "I",
"start": 8.5,
"end": 9.300000190734863
},
{
"word": "had",
"start": 9.300000190734863,
"end": 9.380000114440918
},
{
"word": "a",
"start": 9.380000114440918,
"end": 9.520000457763672
},
{
"word": "job",
"start": 9.520000457763672,
"end": 9.680000305175781
},
{
"word": "Mama",
"start": 9.779999732971191,
"end": 9.880000114440918
}
]
}
Note that the words “I”, “had”, “a”, “job”, “Mama” at the end of the words array do not exist in the original transcription.
Here is the audio file: out006.mp3 - Google Drive (I realise the quality is garbage, but I would like a guarantee that the words in both returned transcripts are the same)
Is there anything I can do to guard against this issue?