'Transcription Outsourcing, LLC' repeated throughout whisper transcript

Hi, I’ve made a little flask route that does a whisper/chatgpt workflow for audio I send for transcription. It’s been working overall but a weird response keeps coming in for a specific audio. It’s just a bit from a podcast on willpower. A self-help oriented clip, that no matter how many times I put it through returns a bizarre pattern -over and over again. Here’s the contents (I have a summary generated and the transcript)

Summary: The video highlights the services of Transcription Outsourcing, LLC. The company is repeatedly mentioned, emphasizing its role in providing transcription services.

Transcription:
Transcripts provided by Transcription Outsourcing, LLC. Transcripts provided by Transcription Outsourcing, LLC. Transcription Outsourcing, LLC. Transcription Outsourcing, LLC. Transcription Outsourcing, LLC. Transcription Outsourcing, LLC. Transcription Outsourcing, LLC. Transcription Outsourcing, LLC. Transcription Outsourcing, LLC. Transcription Outsourcing, LLC.

Obviously, this is not what’s going on at all in the audio, so it sent me looking for this company. I thought how can this be? That was a rabbit-hole move because its an actual company that does transcriptions. Who cares, right?

Then I found another post : (see audio-transcription-behaves-erratic/569684) from a year ago. Similar situation but not exactly. This post is explain they are seeing this companys name randomly within their transcripts. WTF is actually going on here?

Now, this happened three times with the specific file using my flask route. So i took the file converted it to another format and just submitted to whisper for a context-less, prompt-less transcription. It did in fact receive the correct transcript. I thought, “Oh, might have been a fluke maybe that company uses whisper all day as their bread and butter so the model is loaded with their name so if you have prompts in your whisper sends, it convolutes, or somehow interprets one of their rules to mark up your transcript.”

I know, this sounds bizarre but its the only working theory I have.

Figuring this was a whisper hallucination (see whisper-transcription-failures-and-hallucinations/705634) just for giggles I tried one more time while writing this post on my flask route. Keep in mind, I already got the transcript using a straight to whisper method with no prompt. But I wanted to test it out again to see if getting the correct transcript would happen using the route. Suprisingly, no, it just varied the summary promoting this company again. This is what was returned:

Summary: The video focuses on the frequent repetition of a company's name. A TikTok video humorously emphasizes the repetitive mention of "Transcription Outsourcing, LLC" multiple times.

Transcription:
Transcripts provided by Transcription Outsourcing, LLC. Transcripts provided by Transcription Outsourcing, LLC. Transcription Outsourcing, LLC. Transcription Outsourcing, LLC. Transcription Outsourcing, LLC. Transcription Outsourcing, LLC. Transcription Outsourcing, LLC. Transcription Outsourcing, LLC. Transcription Outsourcing, LLC.

Like a mantra or some ghost in the machine stuff. What do you think? If anyone wants the file to test I can provide it. Utterly baffled right now.

1 Like

Silence leads to hallucinations.

Post process by slicing out any transcriptions that fall between silences by using the timestamps.

Also, what’s your temperature set at? Set it to 0 if not already there

There’s no silence and I don’t think you read the tests I did. Reread my post.

I did. What’s your temperature set as

I’m not setting temperature.

Didn’t see that in the Open-AI docs and a search does not clearly instruct to do that. Yes, I see a few people using the model on huggingface discussing temperature and on github but nothing is really clear about it. One guy asks if he should with no responses.

If I did add a temperature line and it worked (for the flask route bit) that would mean the raw send (just sending the audio without a prompt which didnt return the mantra) might “auto set” the temperature? I’m baffled at how one query would work and the other send variation of this company’s name (*for this one specific audio file.)

Unsanitized data being used as the reward-training of audio-to-text modalities results in activation of that training inference on similar inputs.

The highest occurrence of replaying such ancillary metadata would be from under 30-second clips that appear to wrap up, reaching a conclusion, and mirror the types of data that Ditto Transcripts would perform on public data that could be scraped by an AI company thirsty for labeled audio data, such as government works for the hearing-impaired.

No temperature set is good. It defaults at 0.

What is your prompt & which model are you using?

Here are the bits from my code.

For the main transcription:

            'model': 'whisper-1',
            'response_format': 'text',
            'prompt': 'This content is sourced from a video or similar online media. It may include informal language, slang, background music, or sound effects. Please transcribe the spoken words verbatim and note any significant audio cues or multiple speakers, if present. The transcription should differentiate clearly between the primary speaker and any additional sounds or voices.'

For the summary:

        gpt_summary_url = "https://api.openai.com/v1/chat/completions"
        summary_prompt = f"You are SummarizeGPT. You create a faithful and easy to read, punctuated summary of the transcription which originated from an online video. Keep the summary short as possible with first sentance being the main idea followed by a very brief summary. \n\nTranscription:\n{transcription}"
        'model': 'gpt-4o-2024-08-06',
        'messages': [{"role": "user", "content": summary_prompt}],
        'max_tokens': 100  # Adjust token count for summary length

Interesting. Some factor in my workflow is activating a training inference that spouts out that company name. A fluke of activation or maybe like a coded way to hack whisper to promote a company? Shouldn’t this be classified as an ‘exploit’ in the hacker sense. A way to get whisper to promote products? Brushing it off as a result of unsantized training opens the door to the latter interpretation.

Or, I imagine it’s something about my file specifically. This audio file my program generated off the video specifically maybe was transcribed by this company at some point -so its glitching out whisper? Hmmm…

There must be a considerable about of inferences to that company in the training data since others are complaining about this specific pattern since '23. My main takeaway from this answer and correct me if I’m wrong, is it can be a number of factors activating this “training inference”; the filename, a sequence of words in my prompt, whether it was used in training, or the length of the clip. But also, your explanation lends itself to the idea whisper can be hacked to 'transcript-troll’ users transcripts.

I think you have reproduced your trigger for us.

You are simulating your own unsanitized data.

“prompt” is not talking to Whisper AI itself, giving it instructions about what it is supposed to do.

The prompt, instead, if given, should be lead-up transcription just before the audio provided, or can be used to simulate that prior audio, such as introducing a speaker or their language.

By enclosing the audio in such an uncharacteristic prefix, a suffix seems also a logical thing for an AI to conclude with.

2 Likes

Yup, this is correct. You are using the prompt incorrectly

Story of my life.

Oddly enough chatgpt gave me that instruction and prompt, go figure. Commented ‘prompt’ out completely and it worked. Have no idea what I can possibly put in there anyway that would help it understand the random clips im sending it.

Indeed I was. Still baffling how someones LLC can just take over a completion. Yes, removing the prompt solved the problem but im no closer to understanding how to use it for my use case or the deeper implications of ‘transcript trolling’

The prompt can be considered the start of the transcript.

Keep in mind that these models weren’t always instruct-based, this was a feature "baked " in. Before this you would use it as a form of auto completion:

“Sticks are”

" sticky."

You could also do

“Arrrrr, sticks, they”

" be sticky, arrrrr"

Same concept with whisper. It’s not instruct, you can “prime” the response to fit a style, understand & use acronyms, include utterances etc


So, following that logic it makes sense when the start of your prompt doesn’t match the actual transcript, the model enters a stage of insanity (repeating itself), often emitting whatever is prevalent in the space that you’ve placed it.

Talk about transcripts → Forever banished & trapped to the weird training data realm about transcripts

1 Like

Sounds like you figured out the issue. I would just add that as a general rule you should always use an active voice in your prompts versus a passive voice. I asked GPT to rewrite your prompt to use an active voice and this is what it came up with:

This content comes from a video or similar online media, featuring informal language, slang, background music, or sound effects. Transcribe the spoken words exactly as they are, and make sure to note significant audio cues or multiple speakers when they appear. Clearly distinguish the primary speaker from any additional sounds or voices in the transcription.

Basically everything needs to be a clear set of instruction to the model. Using a passive voice in prompting tends to open up the prompt for interpretation by the model. It gives the model more room to decide on its own what it wants to do and that’s generally not what you want to have happen.

1 Like

The whisper model does not work with instructions

Since it wasn’t trained using instruction-following techniques, Whisper operates more like a base GPT model. It’s important to keep in mind that Whisper only considers the first 244 tokens of the prompt.

https://platform.openai.com/docs/guides/speech-to-text/improving-reliability

1 Like

I can confirm that same thing happens for Turkish langauge. It keeps repeating Altyazı M.K which means something like: Subtitles by M.K
I researched it and found out that M.K is an author that adds subtitles for various content and OpenAI probably scraped data from there and AI got trained to put different authors for silence parts. It’s pretty annoying, I ended up deleting all case insensitive altyazı xxx from the output and trimming the final text also checking if the final output is not an empty text.

Good to know… So whisper works more as a true completion prompt.

I researched this a little more on Google. Problem seems to be more serious for Turkish language. I saw too many websites already started using Whisper as transcription model and they are indexed to Google. There are dozens of pages with text “Altyazı M.K” in it. This transcription author (M.K) is becoming extremely famous :sweat_smile:. Twitter and various other big platforms seems to be using Whisper model and random Twitter posts show up with this keyword.

2 Likes