I want to receive the chat completion as an audio stream and play it for the user (like the voice feature in the OpenAI app)
One way to do it is to receive it as a stream of text and use the TTS api to turn it into an audio stream, but that means I’ll need to send multiple TTS requests for different chunks of the received text. The RPM for the TTS API is 3, so that would not be feasible.
I think the ideal way would be just directly receiving an audio stream from the chat completion API. Does anyone have any tips?
Only when in a free trial. Pay up, and that rate limit is increased.
The chat completions endpoint returns the text generated by the AI. There are no other features except for returning function-call language in a different manner.
The TTS endpoint accepts up to 4096 characters. That allows for almost all responses that aren’t multiple minutes of an AI reading text to you. You’d likely want to do additional system prompting to tell the AI “User receives text as audio, avoid output over 400 words” or similar.
It shouldn’t take much chunking of streamed output to simulate responsiveness. You can send the first two sentences off for TTS, and by the time that is read to the user, you can probably have encoded the rest.
(Also, nobody says you have to use OpenAI’s TTS service…)