Hey,
I use the code below to obtain both word and sentence-level transcripts simultaneously. However, the timestamps for some words do not align with the sentence timestamps.
from openai import OpenAI
client = OpenAI(api_key=os.environ['OPENAI_API_KEY'])
with open(audio_path, 'rb') as audio_file:
return client.audio.transcriptions.create(
model="whisper-1",
file=audio_file,
response_format="verbose_json",
timestamp_granularities=["word", "segment"]
)
Sentence-Level
05 = {dict: 10} {'avg_logprob': -0.16167913377285004, 'compression_ratio': 1.665480375289917, 'end': 36.779998779296875, 'id': 5, 'no_speech_prob': 0.009557976387441158, 'seek': 2830, 'start': 32.880001068115234, 'temperature': 0.0, 'text': ' Felsefe ise akla hitap eden bir bilgidir.', 'tokens': [50564, 13298, 405, 2106, 40912, 9308, 875, 2045, 569, 47727, 1904, 8588, 70, 33031, 13, 50704]}
06 = {dict: 10} {'avg_logprob': -0.16167913377285004, 'compression_ratio': 1.665480375289917, 'end': 45.0, 'id': 6, 'no_speech_prob': 0.009557976387441158, 'seek': 2830, 'start': 36.939998626708984, 'temperature': 0.0, 'text': ' Bu yüzden Hristiyanlık, 2. yüzyılda Yunan kültürüne mensup kişilerin Hristiyanlığa girmesiyle birlikte felsefe ile tanışır.', 'tokens': [50714, 4078, 33454, 389, 12940, 4727, 282, 22359, 11, 568, 13, 288, 774, 1229, 21473, 64, 18007, 282, 24572, 2282, 1655, 21148, 10923, 1010, 28212, 5441, 259, 389, 12940, 4727, 282, 75, 7366, 64, 290, 3692, 21181, 2072, 44642, 11094, 405, 2106, 15465, 7603, 4951, 3702, 13, 51129]}
Word-Level
048 = {dict: 3} {'end': 32.880001068115234, 'start': 32.540000915527344, 'word': 'Felsefe'}
049 = {dict: 3} {'end': 33.279998779296875, 'start': 32.880001068115234, 'word': 'ise'}
050 = {dict: 3} {'end': 33.63999938964844, 'start': 33.279998779296875, 'word': 'akla'}
051 = {dict: 3} {'end': 34.040000915527344, 'start': 33.63999938964844, 'word': 'hitap'}
052 = {dict: 3} {'end': 34.2599983215332, 'start': 34.040000915527344, 'word': 'eden'}
053 = {dict: 3} {'end': 34.599998474121094, 'start': 34.2599983215332, 'word': 'bir'}
054 = {dict: 3} {'end': 35.08000183105469, 'start': 34.599998474121094, 'word': 'bilgidir'}
055 = {dict: 3} {'end': 35.52000045776367, 'start': 35.400001525878906, 'word': 'Bu'}
056 = {dict: 3} {'end': 35.84000015258789, 'start': 35.52000045776367, 'word': 'yüzden'}
057 = {dict: 3} {'end': 36.779998779296875, 'start': 35.84000015258789, 'word': 'Hristiyanlık'}
058 = {dict: 3} {'end': 37.18000030517578, 'start': 36.939998626708984, 'word': '2'}
059 = {dict: 3} {'end': 38.119998931884766, 'start': 37.34000015258789, 'word': 'yüzyılda'}
060 = {dict: 3} {'end': 38.560001373291016, 'start': 38.119998931884766, 'word': 'Yunan'}
061 = {dict: 3} {'end': 39.08000183105469, 'start': 38.560001373291016, 'word': 'kültürüne'}
062 = {dict: 3} {'end': 39.540000915527344, 'start': 39.08000183105469, 'word': 'mensup'}
The words “Bu” , “yüzden” , and “Hristiyanlık” do not exist in the sentence, yet their timestamps fall within the start and end times of the sentence-level transcript.
Am I doing something wrong?