Send data stream to TTS API

Hi,
I am learning to develop a feature as follows:

  1. Users submit questions by voice. I use the Whisper API to convert speech to text.
  2. Next, I send the above text to API Assist so that it responds based on the document I provide, then returns the data as a stream.
  3. I use socketIO to send this stream data to the user, the same way ChatGPT is doing.

My difficulty right now is that in addition to the text answer, I want to have a voice answer, both will be answered at the same time, just like when you watch a movie with subtitles.

Is there any solution to send stream data from Assist API to TTS API? I know TTS can return stream data to play audio but I don’t see any documentation regarding receiving text stream data to TTS.

I’m using NodeJS. If you have any solution, can you give me a reference? Thanks everyone.

2 Likes

There is no “at the same time”, unless you are doing your own parsing on the AI output a sentence at a time, and sending each for transcriptions. If streaming, it would instead be identifying the point where a complete section can be spoken, by intelligent identification of what is being built until sentence-length pieces are compete thoughts, get the audio, and hold back on the text display until the first chunk is complete. Then buffering what continues after.

The only way I can see to sync voice to a transcript reliably for realtime display (up to the level of coloring words as they are spoken) is to then send to whisper for time indexing, and play the transcript back as text at the same display rate as the timestamps.

2 Likes

@_j Thank you for your reply, since I have little experience dealing with these issues, do you have any example code that I can refer to?