Whisper API: a) Timecodes; b) how good is open-source vs API?

I am using Whisper API to transcribe some video lectures and conferences. It is actually working very well, even for smaller languages it is on much better level than I have seen before.

I have two questions though:
1.) Is it possible to get timecodes via API? I can’t see it in the docs, so I am guessing the answer is ‘no’, but since Whisper was offering this in open-source (install) version, it is strange that this is not the option for the paid API as well.

2.) Did anyone compare open-source version, running on the server, with the API version? How do they compare - is the API version built on a later model and is therefore more precise? I have used only “small” model as installed on my server.


Sorry, this isn’t really an answer, but I did spend a bit of time looking into 1 and I also couldn’t find any way to add timecodes. The locally hosted version returns several pieces with just the text being one and the timecodes beign another, and it looks like the api just returns the text portion.

I guess we will have to wait or to use the self-hosted version.

I am now looking into some possibilities of recognizing speakers, but it seems I would need to a) do that in Python first, via some library, b) send separate audio files on basis of that to the transcription. Not sure if there is more efficient way at the moment.