Implementing audio conversation with AI

Having a audio conversation with AI seems like a game changer, and I want to integrate that into my app. So looking at the API the way to do it currently is to:

  1. record audio
  2. send it to the transcription endpoint
  3. send the transcribed text to the chat endpoint to get AI text response
  4. send the AI text response to the text-to-speech endpoint

I haven’t implemented it yet, but it seems to me be a lot of steps and might cause lag issues. Is this a feasible way to go and are there any plans to extend the api to e.g. do steps 2-4 in one go e.g. a audio dialogue endpoint?

1 Like

Iirc steps 3 and 4 support streaming which should reduce lags. Not aware of any more consolidated approach though.


have you had success accessing the text to speech endpoint ? My gpt never gets an mp3 back…

1 Like

Yes this works for me:

    const generateSpeech = async () => {
        try {
            const response = await
                    model: "tts-1",
                    input: text,
                    voice: "alloy",
                    headers: {
                        Authorization: `Bearer ${
                    responseType: "blob",

            const url = window.URL.createObjectURL(new Blob([]));
            const audio = new Audio(url);
        } catch (error) {

Managed to solve this for React Native Expo if anyone is interested.
Posted it as answer to SO post about ‘play-audio-response-from-openai-tts-api-in-react-native-with-expo’ (cannot post links here)

1 Like

Drop me a DM with the link and I’ll add it to your post for you :+1:

I do these exact procedures in my Bash shell API wrapper for OpenAI (GitHub: mountaineerbr/shellChatGPT).

The only thing you missed is playing the received audio file from OpenAI. Try requesting for Opus, which is a more modern format. Also, you can play the audio file while still receiving it! Just beware of the player you will be using, as some of them will pause a little when the buffer is empty until new data arrive, and others will just stop playing when the buffer is empty…

I got a replay command in my shell script, so that if the audio player fails (in Termux, that is the case), then user may replay. If playing with desktop player like cvlc or any other media player, they usually can handle and wait more data coming and don’t just abort.

would be awesome to do that. can’t find any DM button anywhere on the forum… :confused:

That’s exactly what I did and yeah it seems like a lot of steps. The lag is actually better than I expected. It’s about the same as talking to the OpenAI app on iPhone, which is tolerable for my hobby robot project. TTS seems to take the most time - would love to figure out how to stream it -

1 Like