Timestamped Captions for TTS API [Feature Request]

I am currently using your Text-to-Speech API for an educational project aimed at providing accessible learning materials.

To enhance accessibility, it’d be great to have access to the timestamped transcript along with the TTS response. This would enable users to see captions that are perfectly synchronized with the speech output.

Here are a few challenges with different approaches I’ve tried:

  1. Browser-based caption generators like “react-speech-recognition” are not sufficiently accurate or synchronized with the TTS output.
  2. Using Whisper for a separate transcription adds complexity, leads to synchronization issues and increases cost.
  3. Displaying the original text would not account for the varying speech rates, making it difficult to follow along.

Thanks!

2 Likes

I think you need my npm library

React / Vanilla JS Text to Speech with highlighting the words and sentences that are being spoken using audio files, text to speech API, and web speech synthesis API

It can produce timestamp timing for each word in client side (no need to use whisper) with just input text and the audio file that generated from TTS API you can do tts with highlight.

Not just have that capability, it have many poweful and flexible programmatic API that you can just use.

Just checkout my repo or try the demo website