The "Audio out" wav is less than "Generation"

I I am using realtime api with sip to communicate with openai, is this normal or a bug?

The text is an AI transcription from hearing the produced audio. It can make mistakes.

Does the audio file itself sound complete, or is it truncated mid-speech?

1 Like

the audio file only lost “Correct?” than the text output.

our prompt wants ai to say “Let me confirm your order. You ordered [[all dishes]]. The total is ${{price}}. Correct?“, so the text output is what we want, but the audio output is wrong.

1 Like

I can’t say if it is a bug, so much as a model behavior.

The first thing I would try: have that be a complete sentence: “Is your order correct?”

Then prompt up the phrase more in system message as a final output requirement after stating or restating someone’s order.

You can’t place your own “assistant” messages as audio, I suspect so that you can’t influence the speech. However, you could inject a user message early in proper context, “system reminder, after reciting an order you must employ the phrase Is that correct” - and then place that phrase as a recording of the chosen voice model’s output, it saying the message in the tone you want.

You can consider other turns of an order conversation that also need structure reinforced in similar manner with more verbose language that won’t result in truncation by the generated audio token stream trailing off or whatever is happening.

I have tried the follow prompt before:
“Let me confirm your order. You ordered [[all dishes]]. The total is ${{price}}. Is everything correct?”
then the text output is correct, but the audio output lost the “Is everything correct?” sometimes

As the same, I have another prompt to make the AI say:
“Please wait, I’ll place your order now. May I have your name?”
the text output is correct, but the audio output lost the “May I have your name?” sometimes