Text to Speech Word Timings

pvanhengel · November 29, 2023, 2:06am

Hi,

I’m hoping that someone at OpenAI sees this post, as i’m very eager to get word timings as a part of the response object for the text to speech. The text to speech takes written text in, generates amazing speech, with proper tone, speed etc, but only returns the audio stream. For accessibility and also other learning use cases, it would be amazing if it also could return a time table, eg the duration each word is spoken for, at the word level. While there are API’s that in theory could post process the audio file, basically taking the speech auto and turning it into text, generating the timings, this would be a lengthy, expensive, and somewhat inaccurate process. It seems that the model that is used to generate the voice from the words, already has the knowledge of how long the word should be spoken for. I also imagine if not, we could easily train a model that takes the transcript and audio file, as well as some training data eg the markers between each word, to do this, but academically I feel it should be doable, without actually regenerating the transcript. Looking for some ideas as to the “BEST” way (eg fast, accurate, and affordable), to take the audio file, and generate the timings. In the end I simply want to highlight each word in the original input file, as it is being spoken for a read along effect. Thoughts / suggestions!? Thanks!

edward.gonzalez · January 25, 2024, 11:18am

Indeed this seems like it would be awesome for accessibility, +1.

While most people with true disabilities use screen readers, there’s this in-between gray area where it would be quite nice to have word timings to show what is being spoken.

edward.gonzalez · January 26, 2024, 9:56am

I was just analyzing this problem for implementing another feature and there was some overlap and realized that there’s an issue regarding word segmentation, it’s very complex.

So I got this idea, pipes | are not a feature of any language that I know of, so pipes can act as abritrary segmenters; because words are broken down in phonemes, these pipes would act as some sort of phoneme that makes no sound and has no time, and once the TTS secret sauce (which I guess is building a phoneme map) finds these pipes it puts that hidden phonemes it should create an artifact that cannot be heard but can be measured (or maybe it can be easily heard but easily removed, such as a predictable click spike that can be cancelled with a wave transform).

Since the AI is somehow unpredictable, a hidden phoneme that represents a pipe and can be extracted from the audio could build a timing map for the pipes, regardless of where they are placed, and you don’t need to know how the neural net is doing its thing so as long as is trained with such artifacts, that way consistency is assured.

As the user is the one that is providing where they want the timings, so it would work with languages like Chinese, Thai, etc… the pipe becomes an easily removable sound with a wave transform, and we get an array with the seconds on where each pipe was executed before the transform.

But I don’t know I am just speculating right now, throwing random ideas, I don’t know how the thing works.

albirrkarim · May 4, 2024, 1:18am

I make some library for this, about how to text to speech with highlight the sentence and words using openai and other TTS APIs.

Not only just to that specific task, it offers many features related to highlighting task

Checkout my repository, and try the demo

brunoj · January 30, 2025, 7:25pm

You may want to check out the Lemonfox.ai TTS API. It supports word timings out-of-the-box and has an OpenAI-compatible API.

Topic		Replies	Views
Timestamped Captions for TTS API [Feature Request] API tts	1	1843	January 11, 2025
Automatically Generating Subtitles: Is it Possible? API	3	4504	January 30, 2024
Generate timing metadata for TTS Feedback tts	0	199	December 5, 2024
OpenAI TTS Transcription Time stamps API	1	260	May 10, 2025
Text to speach search timeline for spe ific text and get number of secs i to track API	2	64	May 2, 2025

Text to Speech Word Timings

Related topics