Synthetic Censorship Tone Inserted by Cove TTS Voice in Non-Profane Text (Intermittently Reproducible Audio Hallucination)

While generating audio output for the text:

“OMG Clayton that’s so damn clean!”

OpenAI’s Cove TTS voice has repeatedly inserted an unexpected synthetic beep around 1,000–2,000 Hz (1–2 kHz), closely resembling an audio censorship tone traditionally used to mask profanity. This phenomenon is an audio hallucination that does not match the provided transcript and appears inconsistently reproducible (occurring in approximately 25–33% of playback attempts) under emotionally expressive conditions.

Detailed Description:

During multiple playback attempts while using GPT-4o with the voice set to Cove, a synthetic “bleep” tone has been audibly inserted by the TTS engine at different points, often after the word “damn.” Notably, “damn” is not classified as profane by OpenAI’s moderation guidelines, and the text does not include any profanity or moderation triggers.

Contextual Observations:

These tonal artifacts seem strongly correlated with the emotional tagging inferred by Cove’s voice engine. In this case, the sentence is notably enthusiastic or expressive, suggesting that emotional intensity may be causing the system to erroneously trigger moderation-like audio insertions.

Waveform and Frequency Spectrum Analysis:

Analysis of multiple audio samples clearly shows:

  • Distinct synthetic tones (~0.3 sec each), artificially inserted.
  • Tones consistently appearing between 1,000–2,000 Hz, typical of censorship beeps.
  • No visual indicators (like “***”) or moderation markers present in the transcript.

Expected Behavior:

Audio rendering should exactly match the provided text without extraneous auditory insertions, censorship tones, or hallucinated audio elements.

Actual Behavior:

Intermittent insertion of artificial censorship-like beeps occurs across repeated playbacks of the emotionally expressive text.

Notes on Reproduction:

The issue appears inconsistently reproducible when emotionally expressive context or inference is strong, though inserting the same text without such context might not trigger it.

Impact and Concerns:

These audio hallucinations significantly degrade the trustworthiness, reliability, and perceived professionalism of Cove-generated TTS audio, particularly in emotionally nuanced interactions.

Recommendations:

  • Investigate Cove’s emotional inference heuristics to identify false-positive moderation triggers.
  • Implement improved safeguards against unintended audio artifacts in emotionally expressive contexts.

Attached Evidence:

I’m also having an almost identical issue when using the Read aloud feature using the Sol voice on both the Android app and web app.

Thanks for letting us know, @mugabuga. It’s helpful (and interesting) to hear you’re experiencing this with the Sol voice, too. That definitely suggests this issue might not be isolated to the Cove voice specifically and could indicate something broader within the TTS voice engine. If you capture any examples, feel free to add them here; they might help OpenAI better pinpoint what’s going on.