Speech To Text words details

Hello everyone, is there any way to get more detail in the results of a transcript? I would need to know the exact time of each transcribed word/token, not whole sentences. Is this possible in any way?

Unfortunately not via API. You would need to use the open source Whisper model and combine it with additional alignment models.

Check whisperX, although the word level time codes are not perfect.