Discrepancy in segment level vs word level time stamps with whisper API

Hello!
I have finally been able to get time stamps to work with a python based API call however I have noticed that the timing at word level is different than the segment level timing.

Segment-level Details:
Time: 0.000s - 1.600s, Text: Look, you’ll love the commute.
Time: 1.600s - 4.320s, Text: The position comes with that house for you and your wife,
Time: 4.320s - 5.480s, Text: and your, is it two?
Time: 5.480s - 6.400s, Text: Children.
Time: 6.400s - 7.760s, Text: Yes, two.
Time: 7.760s - 10.440s, Text: I’m a great admirer of your work.

Word-level Details:
Time: 0.000s - 0.120s, Word: Look
Time: 0.120s - 0.340s, Word: you’ll
Time: 0.340s - 0.440s, Word: love
Time: 0.440s - 0.640s, Word: the
Time: 0.640s - 0.920s, Word: commute
Time: 1.560s - 1.780s, Word: The
Time: 1.780s - 2.020s, Word: position
Time: 2.020s - 2.240s, Word: comes
Time: 2.240s - 2.380s, Word: with
Time: 2.380s - 2.780s, Word: that
Time: 2.780s - 2.780s, Word: house
Time: 2.780s - 2.980s, Word: for
Time: 2.980s - 3.140s, Word: you
Time: 3.140s - 3.260s, Word: and
Time: 3.260s - 3.540s, Word: your
Time: 3.540s - 3.700s, Word: wife
Time: 4.200s - 4.500s, Word: and
Time: 4.500s - 4.920s, Word: your
Time: 5.320s - 5.320s, Word: is
Time: 5.320s - 5.320s, Word: it
Time: 5.320s - 5.320s, Word: two
Time: 5.540s - 5.840s, Word: Children
Time: 6.400s - 6.700s, Word: Yes
Time: 6.700s - 6.980s, Word: two
Time: 7.640s - 7.960s, Word: I’m
Time: 7.960s - 8.220s, Word: a
Time: 8.220s - 8.500s, Word: great
Time: 8.500s - 8.980s, Word: admirer
Time: 8.980s - 9.160s, Word: of
Time: 9.160s - 9.320s, Word: your
Time: 9.320s - 9.640s, Word: work

Is this a bug or just to be expected?

def transcribe_audio(file_path, granularity):
client = OpenAI(api_key=‘’)
with open(file_path, ‘rb’) as audio_file:
try:
transcript = client.audio.transcriptions.create(
file=audio_file,
model=“whisper-1”,
response_format=“verbose_json”,
timestamp_granularities=granularity # Can be a list like [‘segment’, ‘word’]
)
return transcript
except Exception as e:
print(“An error occurred:”, e)
return None

def format_transcription(transcription, granularity):
if not transcription:
return “Failed to transcribe audio.”

formatted_text = f"Transcription Details:\nLanguage: {transcription.language}\nDuration: {transcription.duration}s\n\n"

# Handle segment granularity
if 'segment' in granularity:
    formatted_text += "Segment-level Details:\n"
    try:
        for segment in transcription.segments:
            start_time = format(segment['start'], ".3f")
            end_time = format(segment['end'], ".3f")
            formatted_text += f"Time: {start_time}s - {end_time}s, Text: {segment['text']}\n"
    except KeyError:
        formatted_text += "No segment data available.\n"

# Handle word granularity
if 'word' in granularity:
    formatted_text += "\nWord-level Details:\n"
    try:
        for word in transcription.words:
            start_time = format(word['start'], ".3f")
            end_time = format(word['end'], ".3f")
            formatted_text += f"Time: {start_time}s - {end_time}s, Word: {word['word']}\n"
    except KeyError:
        formatted_text += "No word data available.\n"

return formatted_text