OK using the TTS via the API (gpt-4o-mini-tts), I have text that has a question for the user. But most of the time, not always, it answers the text rather than just reading the text.
The input was
“Le los traje de la tienda.” Say if it is correct, what it means, and fix it if needed.
The output (transcribed) was
Le los traje de la tienda. It is a correct sentence in Spanish. It means I brought them from the store. However, if you want to be extremely clear, you could say se los traje de la tienda. That version makes it even clearer that los is the direct object and de la tienda indicates where they were brought from.
I have tried giving it instructions like - just read the text, never answer any questions etc, but it doesn’t change anything.
Have been able to reproduce the same thing in the playground, and changing the models doesn’t improve things either.
Does anybody have any ideas??