Having a audio conversation with AI seems like a game changer, and I want to integrate that into my app. So looking at the API the way to do it currently is to:

  1. record audio
  2. send it to the transcription endpoint
  3. send the transcribed text to the chat endpoint to get AI text response
  4. send the AI text response to the text-to-speech endpoint

I haven’t implemented it yet, but it seems to me be a lot of steps and might cause lag issues. Is this a feasible way to go and are there any plans to extend the api to e.g. do steps 2-4 in one go e.g. a audio dialogue endpoint?

1 Like

Iirc steps 3 and 4 support streaming which should reduce lags. Not aware of any more consolidated approach though.

2 Likes

have you had success accessing the text to speech endpoint ? My gpt never gets an mp3 back…

1 Like

Yes this works for me:

    const generateSpeech = async () => {
        try {
            const response = await axios.post(
                "https://api.openai.com/v1/audio/speech",
                {
                    model: "tts-1",
                    input: text,
                    voice: "alloy",
                },
                {
                    headers: {
                        Authorization: `Bearer ${
                            import.meta.env.VITE_OPENAI_API_KEY
                        }`,
                    },
                    responseType: "blob",
                }
            );

            const url = window.URL.createObjectURL(new Blob([response.data]));
            const audio = new Audio(url);
            audio.play();
        } catch (error) {
            console.error(error);
        }
    };