Whisper API: a) Timecodes; b) how good is open-source vs API?

I am using Whisper API to transcribe some video lectures and conferences. It is actually working very well, even for smaller languages it is on much better level than I have seen before.

I have two questions though:
1.) Is it possible to get timecodes via API? I can’t see it in the docs, so I am guessing the answer is ‘no’, but since Whisper was offering this in open-source (install) version, it is strange that this is not the option for the paid API as well.

2.) Did anyone compare open-source version, running on the server, with the API version? How do they compare - is the API version built on a later model and is therefore more precise? I have used only “small” model as installed on my server.


Sorry, this isn’t really an answer, but I did spend a bit of time looking into 1 and I also couldn’t find any way to add timecodes. The locally hosted version returns several pieces with just the text being one and the timecodes beign another, and it looks like the api just returns the text portion.

I guess we will have to wait or to use the self-hosted version.

I am now looking into some possibilities of recognizing speakers, but it seems I would need to a) do that in Python first, via some library, b) send separate audio files on basis of that to the transcription. Not sure if there is more efficient way at the moment.

You can define the format you want to have returned as body form field. If you define it as “srt”, you get back the timestamps as well

e.g. via curl you add the following parameter --form response_format=srt

Revised (see below for how to get timecode from the api), the advantage of using the API over self hosting is performance and model size, if you wish to run a very capable model from a low power device, offloading that computational task to a remote, enterprise class compute facility is an attractive option. If you have local compute of sufficient power then local use is fine.

  1. add form field, response_format=x where x is srt or verbose_json
  2. IDK

You can also try deepgram, they offer whisper models at a cheaper price and they have diarization+timecodes

whisper api has timestamp like what the previous posters mentioned. just set the “response_format” parameter to srt or vtt.

if you are using the OpenAI Node.JS library:

const resp = await openai.createTranscription(
            fs.createReadStream("audio.mp3"), // file
           "whisper-1", // model
            "", // prompt
           "vtt", // response_format
           0.1, // temperature
           "en", // language

Here’s the Python version to specify response format:

import openai

audio_file = open("/path/to/file.mp3", "rb")
transcript = openai.Audio.transcribe(
    "whisper-1", audio_file, response_format="vtt"
1 Like

I’ve got it, thanks everyone. I think this could be used for speaker recognition as well.
a) Get timecodes
b) Paralelly to sending audio to transcription, do a speaker recognition with other tools.
c) Compare transcription timecodes and timecodes for speaker recognition.
d) Insert information in the final transcription file.

I also noticed there is a verbose json option, which sends back much more detailed timings - not only on the word level, but even on the token level. It would probably work even better for this case, but I guess it would be much more difficult to implement.