I am using TTS to generate instructions to users that match visual cues. My suggestion is that this could extend OpenAiAudioSpeechMetadata. Thus, if my speech prompt is “Press the button now”, I’d like the call to return something like
Response.metadata.timings = [{"word","Press","time", 0.0}, {"word","the","time", 0.04}, ...
Conversely, there could be some sort of unspoken markup character that could trigger this behavior:
String prompt = "The @0 green line indicates money and the @1 blue line depicts happiness";
would yield
Response.metadata.timings = [0.2, 1.2]
I suspect the second approach may be preferable, but I’m open to either. Personally, I don’t mind having to tell tts what I mean with certain characters - “µg” doesn’t sound like “microgram”, so having to write “at” because “@” isn’t a spoken character is no great hindrance.
Several use cases:
- Psychology research (response latency)
- Explaining a complicated user interface
- Gaming
The Spring-ai project defines OpenAiAudioSpeechResponseMetadata, I suggest it could be bolted onto that.