Text to Speech Word Timings


I’m hoping that someone at OpenAI sees this post, as i’m very eager to get word timings as a part of the response object for the text to speech. The text to speech takes written text in, generates amazing speech, with proper tone, speed etc, but only returns the audio stream. For accessibility and also other learning use cases, it would be amazing if it also could return a time table, eg the duration each word is spoken for, at the word level. While there are API’s that in theory could post process the audio file, basically taking the speech auto and turning it into text, generating the timings, this would be a lengthy, expensive, and somewhat inaccurate process. It seems that the model that is used to generate the voice from the words, already has the knowledge of how long the word should be spoken for. I also imagine if not, we could easily train a model that takes the transcript and audio file, as well as some training data eg the markers between each word, to do this, but academically I feel it should be doable, without actually regenerating the transcript. Looking for some ideas as to the “BEST” way (eg fast, accurate, and affordable), to take the audio file, and generate the timings. In the end I simply want to highlight each word in the original input file, as it is being spoken for a read along effect. Thoughts / suggestions!? Thanks!


Indeed this seems like it would be awesome for accessibility, +1.

While most people with true disabilities use screen readers, there’s this in-between gray area where it would be quite nice to have word timings to show what is being spoken.

I was just analyzing this problem for implementing another feature and there was some overlap and realized that there’s an issue regarding word segmentation, it’s very complex.

So I got this idea, pipes | are not a feature of any language that I know of, so pipes can act as abritrary segmenters; because words are broken down in phonemes, these pipes would act as some sort of phoneme that makes no sound and has no time, and once the TTS secret sauce (which I guess is building a phoneme map) finds these pipes it puts that hidden phonemes it should create an artifact that cannot be heard but can be measured (or maybe it can be easily heard but easily removed, such as a predictable click spike that can be cancelled with a wave transform).

Since the AI is somehow unpredictable, a hidden phoneme that represents a pipe and can be extracted from the audio could build a timing map for the pipes, regardless of where they are placed, and you don’t need to know how the neural net is doing its thing so as long as is trained with such artifacts, that way consistency is assured.

As the user is the one that is providing where they want the timings, so it would work with languages like Chinese, Thai, etc… the pipe becomes an easily removable sound with a wave transform, and we get an array with the seconds on where each pipe was executed before the transform.

But I don’t know I am just speculating right now, throwing random ideas, I don’t know how the thing works.