Word-Level and Sentence-Level Transcript Timestamps Do Not Match

hakant · April 4, 2024, 2:01pm

Hey,

I use the code below to obtain both word and sentence-level transcripts simultaneously. However, the timestamps for some words do not align with the sentence timestamps.

from openai import OpenAI
client = OpenAI(api_key=os.environ['OPENAI_API_KEY'])
with open(audio_path, 'rb') as audio_file:
    return client.audio.transcriptions.create(
        model="whisper-1",
        file=audio_file,
        response_format="verbose_json",
        timestamp_granularities=["word", "segment"]
    )

Sentence-Level

05 = {dict: 10} {'avg_logprob': -0.16167913377285004, 'compression_ratio': 1.665480375289917, 'end': 36.779998779296875, 'id': 5, 'no_speech_prob': 0.009557976387441158, 'seek': 2830, 'start': 32.880001068115234, 'temperature': 0.0, 'text': ' Felsefe ise akla hitap eden bir bilgidir.', 'tokens': [50564, 13298, 405, 2106, 40912, 9308, 875, 2045, 569, 47727, 1904, 8588, 70, 33031, 13, 50704]}
06 = {dict: 10} {'avg_logprob': -0.16167913377285004, 'compression_ratio': 1.665480375289917, 'end': 45.0, 'id': 6, 'no_speech_prob': 0.009557976387441158, 'seek': 2830, 'start': 36.939998626708984, 'temperature': 0.0, 'text': ' Bu yüzden Hristiyanlık, 2. yüzyılda Yunan kültürüne mensup kişilerin Hristiyanlığa girmesiyle birlikte felsefe ile tanışır.', 'tokens': [50714, 4078, 33454, 389, 12940, 4727, 282, 22359, 11, 568, 13, 288, 774, 1229, 21473, 64, 18007, 282, 24572, 2282, 1655, 21148, 10923, 1010, 28212, 5441, 259, 389, 12940, 4727, 282, 75, 7366, 64, 290, 3692, 21181, 2072, 44642, 11094, 405, 2106, 15465, 7603, 4951, 3702, 13, 51129]}

Word-Level

048 = {dict: 3} {'end': 32.880001068115234, 'start': 32.540000915527344, 'word': 'Felsefe'}
049 = {dict: 3} {'end': 33.279998779296875, 'start': 32.880001068115234, 'word': 'ise'}
050 = {dict: 3} {'end': 33.63999938964844, 'start': 33.279998779296875, 'word': 'akla'}
051 = {dict: 3} {'end': 34.040000915527344, 'start': 33.63999938964844, 'word': 'hitap'}
052 = {dict: 3} {'end': 34.2599983215332, 'start': 34.040000915527344, 'word': 'eden'}
053 = {dict: 3} {'end': 34.599998474121094, 'start': 34.2599983215332, 'word': 'bir'}
054 = {dict: 3} {'end': 35.08000183105469, 'start': 34.599998474121094, 'word': 'bilgidir'}
055 = {dict: 3} {'end': 35.52000045776367, 'start': 35.400001525878906, 'word': 'Bu'}
056 = {dict: 3} {'end': 35.84000015258789, 'start': 35.52000045776367, 'word': 'yüzden'}
057 = {dict: 3} {'end': 36.779998779296875, 'start': 35.84000015258789, 'word': 'Hristiyanlık'}
058 = {dict: 3} {'end': 37.18000030517578, 'start': 36.939998626708984, 'word': '2'}
059 = {dict: 3} {'end': 38.119998931884766, 'start': 37.34000015258789, 'word': 'yüzyılda'}
060 = {dict: 3} {'end': 38.560001373291016, 'start': 38.119998931884766, 'word': 'Yunan'}
061 = {dict: 3} {'end': 39.08000183105469, 'start': 38.560001373291016, 'word': 'kültürüne'}
062 = {dict: 3} {'end': 39.540000915527344, 'start': 39.08000183105469, 'word': 'mensup'}

The words “Bu” , “yüzden” , and “Hristiyanlık” do not exist in the sentence, yet their timestamps fall within the start and end times of the sentence-level transcript.

Am I doing something wrong?

Topic		Replies	Views
Discrepancy in segment level vs word level time stamps with whisper API API	0	852	May 4, 2024
Timestamp_granularities="word" does not match generated transcript Bugs whisper	5	1278	August 21, 2024
Word level transcription data? Bugs	2	889	February 28, 2024
Whisper Segment Start Times API whisper	1	1815	May 3, 2024
Inconsistency in segment-level timestamps Bugs whisper	0	403	April 25, 2024

Word-Level and Sentence-Level Transcript Timestamps Do Not Match

Related topics