I am currently using your Text-to-Speech API for an educational project aimed at providing accessible learning materials.
To enhance accessibility, it’d be great to have access to the timestamped transcript along with the TTS response. This would enable users to see captions that are perfectly synchronized with the speech output.
Here are a few challenges with different approaches I’ve tried:
- Browser-based caption generators like “react-speech-recognition” are not sufficiently accurate or synchronized with the TTS output.
- Using Whisper for a separate transcription adds complexity, leads to synchronization issues and increases cost.
- Displaying the original text would not account for the varying speech rates, making it difficult to follow along.
Thanks!