Automatically Generating Subtitles: Is it Possible?

OpenAI recently released their own text-to-speech API, allowing you to voice over any text that you have.

This got me thinking, could I use the tts-bot to automatically generate subtitles, on top of the voice over?

It’s not entirely far-fetched, since subtitle files only consist of two things:

  1. text
  2. time at which the text is spoken

Obviously, we already have the text, since that is what we’re feeding to the tts-bot. So, the only missing component is number 2. We just need to know at which time each word is being spoken in the voice over.

That begs the question: is this information available somewhere in the code for OpenAI’s tts-bot? I wonder if this data is available somewhere in the tts-method. From a brief search online, I could not find anything relevant.

Nonetheless, I think it is likely this information exists somewhere in the bot’s code, and is parseable for us to use.


This certainly would be great to have.

Currently you can:

a. Send to whisper-1 and get transcription.

b. Divide the text into sentences, get speech, find the the length of the audio segment and mark start and `end’ for each sentence. Concatenate the audio files and you have full speech with subtitles.

This is a great general solution. You’ll know when each sentence starts and ends. What would be even better is to know when each word starts and ends, that way you can break up subtitles if you need to (ie if the full sentence is too long for one subtitle line).

In that case, access to more time data would be helpful.

It would be great to have a solution like this. Interestingly AWS Transcribe allows the generation of subtitles.

@matthewethan did you find any OpenAI based solution yet?