gpt-4o-mini-audio-preview, temp: 0.7.
98/100 the audio generation works fine. Occasionally, it’ll begin to generation text transcript + audio and stream it back fine, however it’ll hang
and then generation an endless amount of silent audio chunks.
My logs:
(% is the non-silent %) - [in brackets is print out of base64encoded]
<90.75%> Audio chunk [1]: [IgAiACEAIAAhACMAIwAkACEAIQAhACIAIgAiACQAHwAiAB4AHw]
<93.42%> Audio chunk [2]: [qwqbC0oLowq5CuYKYgoxCegKGgpACIgJAggkBjwGwAeBBpYFqg]
<95.23%> Audio chunk [3]: [IgJ/AukBMQG5AiIAMABgAkoBtQB4ApsC/ABlAW0C2wDJAY4CLQ]
<96.08%> Audio chunk [4]: [zPkb+zT8FgBRA/IC7AP+AqUBmwRuBr0GOAmzCq4Lsg4wEegRGB]
<90.37%> Audio chunk [5]: [bgIZA74C7wLBA6UDJwSnBH8EXQQtBnkGpQWhBx8HZQQjBSUG1Q]
<90.93%> Audio chunk [6]: [4wBfAHEAdAA3ABEAHQDX/9P/GQDY/53/lP9+/1D/HP/J/n3+n/]
<87.78%> Audio chunk [7]: [///9//7//P/5//r/+f/9//7//v/7//z/+//5//n/+v/7//n/9/]
<17.69%> Audio chunk [8]: [5gmWCD4IYAciB4sFlARABP4CcwGwAWYAzf5q/pT8DfuK+eT4Vv]
<16.82%> Audio chunk [9]: [/v8AAP7//f/7//z//f/+//3/+v/8//7//P/8//7//P/+//z//f]
<0.00%> Audio chunk [10]: [//8AAP7/AAD+//3//v////7//v8AAP///v/+/////v///wAAAA]
<0.00%> Audio chunk [11]: [AAAAAP//AAAAAP//AAAAAP7///8AAAAAAAD///7/AAABAAAAAQ]
<0.00%> Audio chunk [12]: [AQABAAAA//////7///////3///8AAAAAAAD///7/AAABAAAAAA]
<0.00%> Audio chunk [13]: [AAAAAAAAAAAAAP////////7/AAAAAAAAAAD+//7///8BAAEAAA]
<0.00%> Audio chunk [14]: [AQAAAAAA//////7//v/+//3///8AAAAAAAD+//7/AAABAAAAAQ]
<0.00%> Audio chunk [15]: [AQABAAAA//////3//v////7/AAAAAAAA///+//3///8AAAEAAQ]
...
<0.00%> Audio chunk [101]: [AQABAAAA//////3//v////7/AAAAAAAA///+//3///8AAAEAAQ]
At which point my failsafe interrupts the stream.
However, this is concering. Why does the model perform so well in the beginning and then transition into endless silent audio generated?
I will try higher temperatures as I heard [0.8, 1.2] is recommended and experiment.
I want to add, the issue it not with filtering the silent audio chunks, I am capable of doing that. It’s that the request remains long-running and the LLM essentially gets stuck in a loop.
This equivalent request (when working properly) generates no more than 20-30 audio chunks. For it to reach the 100+ audio chunk marks shows that there is something seriously impacted with it.