While generating audio output for the text:
“OMG Clayton that’s so damn clean!”
OpenAI’s Cove TTS voice has repeatedly inserted an unexpected synthetic beep around 1,000–2,000 Hz (1–2 kHz), closely resembling an audio censorship tone traditionally used to mask profanity. This phenomenon is an audio hallucination that does not match the provided transcript and appears inconsistently reproducible (occurring in approximately 25–33% of playback attempts) under emotionally expressive conditions.
Detailed Description:
During multiple playback attempts while using GPT-4o with the voice set to Cove, a synthetic “bleep” tone has been audibly inserted by the TTS engine at different points, often after the word “damn.” Notably, “damn” is not classified as profane by OpenAI’s moderation guidelines, and the text does not include any profanity or moderation triggers.
Contextual Observations:
These tonal artifacts seem strongly correlated with the emotional tagging inferred by Cove’s voice engine. In this case, the sentence is notably enthusiastic or expressive, suggesting that emotional intensity may be causing the system to erroneously trigger moderation-like audio insertions.
Waveform and Frequency Spectrum Analysis:
Analysis of multiple audio samples clearly shows:
- Distinct synthetic tones (~0.3 sec each), artificially inserted.
- Tones consistently appearing between 1,000–2,000 Hz, typical of censorship beeps.
- No visual indicators (like “***”) or moderation markers present in the transcript.
Expected Behavior:
Audio rendering should exactly match the provided text without extraneous auditory insertions, censorship tones, or hallucinated audio elements.
Actual Behavior:
Intermittent insertion of artificial censorship-like beeps occurs across repeated playbacks of the emotionally expressive text.
Notes on Reproduction:
The issue appears inconsistently reproducible when emotionally expressive context or inference is strong, though inserting the same text without such context might not trigger it.
Impact and Concerns:
These audio hallucinations significantly degrade the trustworthiness, reliability, and perceived professionalism of Cove-generated TTS audio, particularly in emotionally nuanced interactions.
Recommendations:
- Investigate Cove’s emotional inference heuristics to identify false-positive moderation triggers.
- Implement improved safeguards against unintended audio artifacts in emotionally expressive contexts.
Attached Evidence:
-
Multiple audio recordings demonstrating reproducibility. Hosted here: Google Drive
-
Spectrogram analysis images clearly showing inserted synthetic tones: