Changes in text to speech

I am mostly using the API for my self use cases. (not developing commercially at the moment)

And I am mostly using the speech endpoint for text to speech to have a voice response.
I have 2 questions:

  1. is it still the same , or is there a one shot voice endpoint ? I mean any chat gpt response endpoint without a need to use a second call to the speech api ?
  2. Is there a way to get a non-chunked voice response from the speech api ?

thanks

If you mean a direct audio response, there are the gpt-4o-audio-preview and gpt-4o-mini-audio-preview. You can send both audio or text, and receive both audio+transcriptions as responses.

Not sure what you mean here. If it is to identify the audio segments on an audio response, you will have to run a transcription model like whisper or similar to get the timestamps.

thanks for your response. I will check the gpt-4o-mini-audio-preview. Where can I find the API reference for it ?

About the second question; let me clarify…
the epeech endpoint (for tts) returns audio in a chunked format. So we don’t know the length and bitrate of the response audio when connected. And it is very difficult to receive the chunked response with a limited cpu like an esp32s3
That’s why I am looking for a solution with a direct download of audio (mp3/wav etc.)

following is my function code for receiving audio from the speech endpoint. Sometimes it gets stable audio but most of the time it is hardly understandable. (crippled etc.)

Besides, there is no mp3 header defining bitrate etc. (Even no ID3 tag)

void sendTTS(String response) {
  String responseFile = "/response.mp3";
  if (WiFi.status() == WL_CONNECTED) {
    http.begin("https://api.openai.com/v1/audio/speech");
    http.addHeader("Authorization", "Bearer "+ APIKEY);
    http.addHeader("Content-Type", "application/json");

    String jsonPayload = "{";
    jsonPayload += "\"model\":\"tts-1\",";
    jsonPayload += "\"input\":\""+response+"\",";
    jsonPayload += "\"response_format\":\"mp3\",";
    jsonPayload += "\"voice\":\""+voice+"\"";
    jsonPayload += "}";

    int httpResponseCode = http.POST(jsonPayload);

    if (httpResponseCode == 200) {
      
      File file = SD_MMC.open(responseFile, FILE_WRITE);
      if (file) {
        WiFiClient* stream = http.getStreamPtr();
        int bytesRead;

        unsigned long lastReadTime = millis();
        const unsigned long idleTimeout = 2000;
        while (stream->connected()) {
          int available = stream->available();
          if(available>0) {
          uint8_t buffer[available] = {0};
          bytesRead = stream->readBytes(buffer, sizeof(buffer));
          lastReadTime = millis();
          file.write(buffer, bytesRead);
          } else {
            if (millis() - lastReadTime > idleTimeout) break;
            vTaskDelay(pdMS_TO_TICKS(100));
          }
        }
        
        file.close();
        Serial.println("Audio saved to SD card.");

      } else {
        Serial.println("Failed to open file for writing.");
      }
    } else {
      Serial.print("Error code: ");
      Serial.println(httpResponseCode);
    }
    http.end();  // Free resources
        audio.pauseResume(); 
        audio.setVolume(21);
        audio.pauseResume(); 

        bool ret = audio.connecttoFS(SD_MMC, responseFile.c_str());
        if(ret) 
          Serial.println("Response Read OK");
        else
          Serial.println("Response Read Failed");
  } else {
    Serial.println("WiFi not connected");
  }
}

You can explore it further in playground, it will generate the code as you interact.

It uses completions api.

As for how you deal with bitrates you will have to explore or use a different format like wav. It doesn’t provide a bitrate information, you must implement yourself or analyze previously (I didn’t try, but I suppose it doesn’t change - you can verify and hard code it).

There is an example in the docs on how to choose the format.

if I’m not wrong, the chat completions api is accepting wav or mp3 input directly and responses with text and/or audio. But I can’t find example usage in the docs. Am I only required to use the chat completion create endpoint ? Or do I have to create a file first (file upload) and then use the file id in the chat completion endpoint ? https://platform.openai.com/docs/api-reference/chat/create

did you check the example I posted?
Also, in the playground it does provide the code for everything you do.

the example you posted and all code provided through the playground is python code. On python everything is implicit. (the library does everything)
But I am working with c++ where no library is provided by OpenAI. So I need to understand how to use the endpoint. Probably you don’t know as well. But at least you can tell me if I need to send the input audio to the files endpoint or not ?

Apart from that, @everyone : anyone who worked with this API without python ?

No, seriously. I was referring to:

There is a curl example that doesn’t rely on python and is portable to any language.

1 Like

ok I hadn’t noticed that there was a curl option there.
Now I tried posting the following to the endpoint. of course the “data” value is a very long base64 encoded wav file.

{“model”:“gpt-4o-audio-preview”,“modalities”:[“text”],“messages”: [{“role”:“user”, “content”: [ {“type”:“input_audio”, “input_audio”: { “data”:“UklGRiQwAgBX…”, “format”:“wav”}}]}]}

but I got the response:

22:25:53.127 → {
22:25:53.127 → “error”: {
22:25:53.127 → “message”: “We could not parse the JSON body of your request. (HINT: This likely means you aren’t using your HTTP library correctly. The OpenAI API expects a JSON payload, but what was sent was not valid JSON. If you have trouble figuring out how to fix this, please contact us through our help center at help.openai.com.)”,
22:25:53.127 → “type”: “invalid_request_error”,
22:25:53.127 → “param”: null,
22:25:53.127 → “code”: null
22:25:53.127 → }
22:25:53.127 → }

any ideas ?

do I have to add base64 tag in front of the data ?
“audio/wav;base64,”
tried adding it now but it didn’t change the result…

In this case, just the base64 data with no prefix.
Perhaps you may have to adjust the utf-8 encoding when reading the file.

Here is the python converter, for possible references:


def encode_base64(file_path):
    """Function to encode the file"""
    with open(file_path, "rb") as file:
        return base64.b64encode(file.read()).decode("utf-8")

encoded_string = encode_base64('output/speech.wav')

completion = client.chat.completions.create(
    model="gpt-4o-audio-preview",
    modalities=["text", "audio"],
    audio={"voice": "alloy", "format": "wav"},
    messages=[
        {
            "role": "user",
            "content": [
                { 
                    "type": "text",
                    "text": "What is in this recording?"
                },
                {
                    "type": "input_audio",
                    "input_audio": {
                        "data": encoded_string,
                        "format": "wav"
                    }
                }
            ]
        },
    ]
)

I found the problem, now struggling to solve it.
As the base64 string is too long , httpclient on Arduino can not send all data and all json.
I will try with wificlient and try to stream.