I have contacted OpenAI support to see if this is a specification, but have received no response.
Specify audio and text as modalities, and the input transcript is null.
And the output is audio and text, while the input is only text.
Nevertheless, the number of audio tokens in the input increases.And text, of course.
I asked them what appears to be an obviously fraudulent claim, and they, have no answer.Does anyone know anything about this?