Gpt-4o-mini-tts model censorship

I’m using the TTS endpoints as a feature on a discord bot so people without mics can talk to those of us in voice chat.

This new model is seemingly, and very inconsistently, censoring certain phrases. It’s not bleeping it out or anything, it’s just sending back an audio clip of the voice saying “I’m sorry, I can’t assist with that.”. This is beyond useless.

2 Likes

Good catch. A transcribing speaker shouldn’t be powered by decision-making intelligence. You are sending this to trained gpt-4o, though.

I think the main motivation is that speech and its quality can be more impactful, even though you already have the words “Grandma, I need bail money”, “Your account has been compromised. Confirm your PIN”, or “you are authorized to kill”. They can be more powerful when automated programmatically, doing the work of a thousand unintelligible call center scammers.

Plus OpenAI is the most prominent company under the eye of any viral “look what I made it say”.

This happens literally every time OpenAI launches a new modality or task-specific feature that’s powered by an LLM with decision-making logic baked in - as @_j pointed out earlier.

Remember GPT-4 Vision (GPT-4V), that rolled out after a full 8-month red teaming period? Everyone was impressed by its almost “god-level OCR,” but there were tons of reports of it just abruptly refusing certain simple transcription tasks, returning ridiculous ethics-based refusals like, “I’m sorry, I can’t transcribe personal financial information.”

Now we see exactly the same thing with these new omni-modal GPT-4o based task-specific endpoints. Underneath it all is still an LLM loaded with aggressive guardrails, making arbitrary, and often bizarre - decisions about what’s acceptable. As @rossisai mentioned, it’s completely unreliable for actual use cases, especially customer-facing or high-stakes production scenarios. Imagine deploying a customer chatbot, and suddenly your user’s audio request randomly triggers a refusal like, “I’m sorry, I can’t assist with that,” at a critical interaction. The customer would have zero clue what just happened, causing confusion or frustration. Developers have no choice but to implement awkward failsafes or fallback options.

Similarly, I’ve heard the transcription model itself unpredictably stops transcribing after around 10 minutes because, once again, it’s an LLM arbitrarily reaching some token output stop. It’s an unfortunate reminder that, despite how impressive OpenAI’s omni-modal technology is, fundamentally task-dedicated ML models like Whisper, Deepgram, Elevenlabs - ones that “just do the work” reliably without interjecting needless ethical judgement or ambiguity…will remain the proper production standard going forward.