Generate timing metadata for TTS

jeffemandel · December 5, 2024, 5:22pm

I am using TTS to generate instructions to users that match visual cues. My suggestion is that this could extend OpenAiAudioSpeechMetadata. Thus, if my speech prompt is “Press the button now”, I’d like the call to return something like
Response.metadata.timings = [{"word","Press","time", 0.0}, {"word","the","time", 0.04}, ...

Conversely, there could be some sort of unspoken markup character that could trigger this behavior:
String prompt = "The @0 green line indicates money and the @1 blue line depicts happiness";
would yield
Response.metadata.timings = [0.2, 1.2]

I suspect the second approach may be preferable, but I’m open to either. Personally, I don’t mind having to tell tts what I mean with certain characters - “µg” doesn’t sound like “microgram”, so having to write “at” because “@” isn’t a spoken character is no great hindrance.
Several use cases:

Psychology research (response latency)
Explaining a complicated user interface
Gaming
The Spring-ai project defines OpenAiAudioSpeechResponseMetadata, I suggest it could be bolted onto that.

Topic		Replies	Views
Text to Speech Word Timings API tts	4	5501	January 30, 2025
Timestamped Captions for TTS API [Feature Request] API tts	1	1844	January 11, 2025
TTS - adding pauses to speech generations through some kind of input syntax API api , tts	9	10534	July 17, 2024
Wishlist for a potential tts-2 and whisper-2 API Feedback whisper , tts	0	188	October 22, 2024
[SUGGESTION] Add metadata to the Completion API Feedback chatgpt	0	221	November 5, 2024

Generate timing metadata for TTS

Related topics