I am mostly using the API for my self use cases. (not developing commercially at the moment)
And I am mostly using the speech endpoint for text to speech to have a voice response.
I have 2 questions:
is it still the same , or is there a one shot voice endpoint ? I mean any chat gpt response endpoint without a need to use a second call to the speech api ?
Is there a way to get a non-chunked voice response from the speech api ?
If you mean a direct audio response, there are the gpt-4o-audio-preview and gpt-4o-mini-audio-preview. You can send both audio or text, and receive both audio+transcriptions as responses.
Not sure what you mean here. If it is to identify the audio segments on an audio response, you will have to run a transcription model like whisper or similar to get the timestamps.
thanks for your response. I will check the gpt-4o-mini-audio-preview. Where can I find the API reference for it ?
About the second question; let me clarify…
the epeech endpoint (for tts) returns audio in a chunked format. So we don’t know the length and bitrate of the response audio when connected. And it is very difficult to receive the chunked response with a limited cpu like an esp32s3
That’s why I am looking for a solution with a direct download of audio (mp3/wav etc.)
following is my function code for receiving audio from the speech endpoint. Sometimes it gets stable audio but most of the time it is hardly understandable. (crippled etc.)
Besides, there is no mp3 header defining bitrate etc. (Even no ID3 tag)
void sendTTS(String response) {
String responseFile = "/response.mp3";
if (WiFi.status() == WL_CONNECTED) {
http.begin("https://api.openai.com/v1/audio/speech");
http.addHeader("Authorization", "Bearer "+ APIKEY);
http.addHeader("Content-Type", "application/json");
String jsonPayload = "{";
jsonPayload += "\"model\":\"tts-1\",";
jsonPayload += "\"input\":\""+response+"\",";
jsonPayload += "\"response_format\":\"mp3\",";
jsonPayload += "\"voice\":\""+voice+"\"";
jsonPayload += "}";
int httpResponseCode = http.POST(jsonPayload);
if (httpResponseCode == 200) {
File file = SD_MMC.open(responseFile, FILE_WRITE);
if (file) {
WiFiClient* stream = http.getStreamPtr();
int bytesRead;
unsigned long lastReadTime = millis();
const unsigned long idleTimeout = 2000;
while (stream->connected()) {
int available = stream->available();
if(available>0) {
uint8_t buffer[available] = {0};
bytesRead = stream->readBytes(buffer, sizeof(buffer));
lastReadTime = millis();
file.write(buffer, bytesRead);
} else {
if (millis() - lastReadTime > idleTimeout) break;
vTaskDelay(pdMS_TO_TICKS(100));
}
}
file.close();
Serial.println("Audio saved to SD card.");
} else {
Serial.println("Failed to open file for writing.");
}
} else {
Serial.print("Error code: ");
Serial.println(httpResponseCode);
}
http.end(); // Free resources
audio.pauseResume();
audio.setVolume(21);
audio.pauseResume();
bool ret = audio.connecttoFS(SD_MMC, responseFile.c_str());
if(ret)
Serial.println("Response Read OK");
else
Serial.println("Response Read Failed");
} else {
Serial.println("WiFi not connected");
}
}
As for how you deal with bitrates you will have to explore or use a different format like wav. It doesn’t provide a bitrate information, you must implement yourself or analyze previously (I didn’t try, but I suppose it doesn’t change - you can verify and hard code it).
There is an example in the docs on how to choose the format.
if I’m not wrong, the chat completions api is accepting wav or mp3 input directly and responses with text and/or audio. But I can’t find example usage in the docs. Am I only required to use the chat completion create endpoint ? Or do I have to create a file first (file upload) and then use the file id in the chat completion endpoint ? https://platform.openai.com/docs/api-reference/chat/create
the example you posted and all code provided through the playground is python code. On python everything is implicit. (the library does everything)
But I am working with c++ where no library is provided by OpenAI. So I need to understand how to use the endpoint. Probably you don’t know as well. But at least you can tell me if I need to send the input audio to the files endpoint or not ?
Apart from that, @everyone : anyone who worked with this API without python ?
ok I hadn’t noticed that there was a curl option there.
Now I tried posting the following to the endpoint. of course the “data” value is a very long base64 encoded wav file.
22:25:53.127 → {
22:25:53.127 → “error”: {
22:25:53.127 → “message”: “We could not parse the JSON body of your request. (HINT: This likely means you aren’t using your HTTP library correctly. The OpenAI API expects a JSON payload, but what was sent was not valid JSON. If you have trouble figuring out how to fix this, please contact us through our help center at help.openai.com.)”,
22:25:53.127 → “type”: “invalid_request_error”,
22:25:53.127 → “param”: null,
22:25:53.127 → “code”: null
22:25:53.127 → }
22:25:53.127 → }
I found the problem, now struggling to solve it.
As the base64 string is too long , httpclient on Arduino can not send all data and all json.
I will try with wificlient and try to stream.