Inconsistency in segment-level timestamps

I have observed inconsistencies in the return of json segments-levels from the whisper api. Sometimes sentence timestamps overlap, sometimes segments are empty even though there are words present in the words timestamp list.

Here is an example :

{
      "id": 278,
      "seek": 65524,
      "start": 677.5999755859375,
      "end": 681.4400024414062,
      "text": " that you also have to charge every single day.",
      "tokens": [
        51284,
        300,
        291,
        611,
        362,
        281,
        4602,
        633,
        2167,
        786,
        13,
        51448
      ],
      "temperature": 0.0,
      "avg_logprob": -0.2284221202135086,
      "compression_ratio": 1.6830189228057861,
      "no_speech_prob": 0.0001420177286490798
    },
    {
      "id": 279,
      "seek": 65524,
      "start": 681.4400024414062,
      "end": 684.1599731445312,
      "text": " But with this, it's actually more than that.",
      "tokens": [
        51448,
        583,
        365,
        341,
        11,
        309,
        311,
        767,
        544,
        813,
        300,
        13,
        51540
      ],
      "temperature": 0.0,
      "avg_logprob": -0.2284221202135086,
      "compression_ratio": 1.6830189228057861,
      "no_speech_prob": 0.0001420177286490798
    },
    {
      "id": 280,
      "seek": 65524,
      "start": 678.760009765625,
      "end": 681.5599975585938,
      "text": " You have to constantly babysit the battery",
      "tokens": [
        51540,
        509,
        362,
        281,
        6460,
        39764,
        270,
        264,
        5809,
        51680
      ],
      "temperature": 0.0,
      "avg_logprob": -0.2284221202135086,
      "compression_ratio": 1.6830189228057861,
      "no_speech_prob": 0.0001420177286490798
    },
    {
      "id": 281,
      "seek": 65524,
      "start": 681.5599975585938,
      "end": 684.2000122070312,
      "text": " and swap out boosters and charge this thing",
      "tokens": [
        51680,
        293,
        18135,
        484,
        748,
        40427,
        293,
        4602,
        341,
        551,
        51812
      ],
      "temperature": 0.0,
      "avg_logprob": -0.2284221202135086,
      "compression_ratio": 1.6830189228057861,
      "no_speech_prob": 0.0001420177286490798
    }

As you can see, the timestamp segments overlap: (677.59- 681.44), (681.44 - 684.15), (678.76- 681.55), (681.55-684.20)

Here’s another example with the same part:

Segments:

{
            "id": 276,
            "seek": 65364,
            "start": 677.6199951171875,
            "end": 681.4400024414062,
            "text": " that you also have to charge every single day.",
            "tokens": [
                51364,
                300,
                291,
                611,
                362,
                281,
                4602,
                633,
                2167,
                786,
                13,
                51528
            ],
            "temperature": 0.0,
            "avg_logprob": -0.22050553560256958,
            "compression_ratio": 1.6463878154754639,
            "no_speech_prob": 0.0006070720846764743
        },
        {
            "id": 277,
            "seek": 65364,
            "start": 681.4400024414062,
            "end": 681.4400024414062,
            "text": "",
            "tokens": [],
            "temperature": 0.0,
            "avg_logprob": -0.22050553560256958,
            "compression_ratio": 1.6463878154754639,
            "no_speech_prob": 0.0006070720846764743
        },
        {
            "id": 278,
            "seek": 65364,
            "start": 678.760009765625,
            "end": 681.5599975585938,
            "text": " You have to constantly babysit the battery",
            "tokens": [
                51620,
                509,
                362,
                281,
                6460,
                39764,
                270,
                264,
                5809,
                51760
            ],
            "temperature": 0.0,
            "avg_logprob": -0.22050553560256958,
            "compression_ratio": 1.6463878154754639,
            "no_speech_prob": 0.0006070720846764743
        }

Words:

{
            "word": "that",
            "start": 673.47998046875,
            "end": 673.8800048828125
        },
        {
            "word": "you",
            "start": 673.8800048828125,
            "end": 674.219970703125
        },
        {
            "word": "also",
            "start": 674.219970703125,
            "end": 674.8200073242188
        },
        {
            "word": "have",
            "start": 674.8200073242188,
            "end": 675.0999755859375
        },
        {
            "word": "to",
            "start": 675.0999755859375,
            "end": 675.739990234375
        },
        {
            "word": "charge",
            "start": 675.739990234375,
            "end": 675.739990234375
        },
        {
            "word": "every",
            "start": 675.739990234375,
            "end": 676.4000244140625
        },
        {
            "word": "single",
            "start": 676.4000244140625,
            "end": 676.6199951171875
        },
        {
            "word": "day",
            "start": 676.6199951171875,
            "end": 676.8599853515625
        },
        {
            "word": "But",
            "start": 676.8599853515625,
            "end": 677.0399780273438
        },
        {
            "word": "with",
            "start": 677.0399780273438,
            "end": 677.2000122070312
        },
        {
            "word": "this",
            "start": 677.2000122070312,
            "end": 677.4000244140625
        },
        {
            "word": "it's",
            "start": 677.4400024414062,
            "end": 677.6199951171875
        },
        {
            "word": "actually",
            "start": 677.6199951171875,
            "end": 677.9199829101562
        },
        {
            "word": "more",
            "start": 677.9199829101562,
            "end": 678.2000122070312
        },
        {
            "word": "than",
            "start": 678.2000122070312,
            "end": 678.3200073242188
        },
        {
            "word": "that",
            "start": 678.3200073242188,
            "end": 678.6199951171875
        },
        {
            "word": "You",
            "start": 678.6199951171875,
            "end": 678.9400024414062
        },
        {
            "word": "have",
            "start": 678.9400024414062,
            "end": 679.1400146484375
        },
        {
            "word": "to",
            "start": 679.1400146484375,
            "end": 679.6199951171875
        },
        {
            "word": "constantly",
            "start": 679.6199951171875,
            "end": 680.219970703125
        },
        {
            "word": "babysit",
            "start": 680.219970703125,
            "end": 680.8800048828125
        },
        {
            "word": "the",
            "start": 680.8800048828125,
            "end": 681.4400024414062
        },
        {
            "word": "and",
            "start": 681.4400024414062,
            "end": 681.8200073242188
        }

There are several bugs here:

  • words with no segment
  • an empty segment
  • non-transcript words

here is my code

client = OpenAI()

with open(p, "rb") as audio_file:

    transcript = client.audio.transcriptions.create(
        file=audio_file,
        model="whisper-1",
        response_format="verbose_json",
        timestamp_granularities=["word", "segment"]
    )

    transcript_json = {
        "words": transcript.words,
        "segments": transcript.segments
    }

    with open("test.json", "w") as outfile:
        json.dump(transcript_json, outfile, indent=2)

Is there a simple solution ?

1 Like