Word level transcription data?

I’m following the documentation for the v1/audio/transcriptions endpoint, my request looks like this:

with open(audio_file_path, “rb”) as audio_file:
files = {
“file”: audio_file,
data = {
“model”: “whisper-1”,
“response_format”: “verbose_json”,
“timestamp_granularities”: [“word”],

    response = requests.post(url, headers=headers, files=files, data=data)

but I’m only getting segment level transcription data rather than word.

Is word level timestamp_granularities still possible with whisper-1?

Sure. Not still, but rather new.

Some code to send your file and get timed words.

import os
import requests

# Gets the API key from environment variable
api_key = os.getenv("OPENAI_API_KEY")
headers = {"Authorization": f"Bearer {api_key}"}
url = "https://api.openai.com/v1/audio/transcriptions"

with open("audio.mp3", "rb") as audio_file:
    parameters = {
        "file": ("audio.mp3", audio_file),
        "language": (None, "en"),
        "model": (None, "whisper-1"),
        "prompt": (None, "Here is the radio show."),
        "response_format": (None, "verbose_json"),
        "temperature": (None, "0.1"),
        "timestamp_granularities[]" : (None, "word"),
    response = requests.post(url, headers=headers, files=parameters)

if response.status_code != 200:
    print(f"HTTP error {response.status_code}: {response.text}")
    # Get the transcribed text and timed words from the response
    transcribed_text = response.json()['text']
    words = response.json()['words']
    formatted_words = [
        {k: f"{v:.2f}" if isinstance(v, float) else v for k, v in word.items()}
        for word in words
    # Save text or words to a file
        with open("transcript.txt", "w") as file:
        print("Transcribed text successfully saved to 'transcript.txt'.")
    except Exception as e:
        print(f"output file error: {e}")


It’s going to keep saving to the same file if you don’t do some more coding. Then you get to decide what to do with the output, or just enjoy the printed start:

Transcribed text successfully saved to 'transcript.txt'.
[{'word': 'This', 'start': '1.04', 'end': '1.60'}, {'word': 'is', 'start': '1.60', 'end': '1.78'}, {'word': 'a', 'start': '1.78', 'end': '1.98'}, {'word': 'radio', 'start': '1.98', 'end': '2.38'}, {'word': 'show', 'start': '2.38', 'end': '2.60'}, {'word': 'where', 'start': '2.60', 'end': '2.86'}, {'word': 'people', 'start': '2.86', 'end': '3.14'}, {'word': 'call', 'start': '3.14', 'end': '3.44'}, {'word': 'us', 'start': '3.44', 'end': '3.64'}, {'word': 'and', 'start': '3.64', 'end': '3.82'}, {'word': 'ask', 'start': '3.82', 'end': '3.98'}, {'word': 'us', 'start': '3.98', 'end': '4.24'}, {'word': 'questions', 'start': '4.24', 'end': '4.52'}, {'word': 'about', 'start': '4.52', 'end': '4.82'}, {'word': 'cars', 'start': '4.82', 'end': '5.14'}, {'word': 'right', 'start': '5.14', 'end': '5.44'}, {'word': 'And', 'start': '5.56', 'end': '5.96'}, {'word': 'what', 'start': '5.96', 'end': '6.16'}, {'word': 'were', 'start': '6.16', 'end': '6.30'}, {'word': 'we', 'start': '6.30', 'end': '6.48'}]

1 Like

turns out I was missing the brackets on the timestamp_granularities parameter:

would be nice if their documentation showed that!

example request in python in documentation shows it this way:

from openai import OpenAI
client = OpenAI()

audio_file = open(“speech.mp3”, “rb”)
transcript = client.audio.transcriptions.create(