Whisper's auto-punctuation

I guess this may or may not have to do with prompting. It has been asked before about a possible way to have Whisper transcribe certain punctuations unfailingly, with reinforcement from prompting. However, my request is different from that.

I would like Whisper to just completely, utterly, unfailingly NOT to put out any punctuation and let myself do punctuation from post-processing. Is that possible?

One obvious method is just to kibosh all of the punctuations one receives from Whisper’s output and then post-process one’s own. However, that does not work because sometimes, Whisper would automatically transcribe something like “exclamation mark”, as “!” (as opposed to its verbatim counterpart).

I suppose a workaround would be to devise one’s own “!” as something like “my custom exclamation”, and reinforce that awkward term in the prompting, and translate the term back to “!” in post-processing.

But all of that convolution is basically asana in.

Is there any way to force Whisper not to output any punctuation by default at all?

You could try with a prompt but this seems like a job for regex. Trying to deviate a model away from proper syntax will be an uphill battle

1 Like

If RegEx could solve the problem, I wouldn’t be asking the questions. Granted, I am not even a “computer person”, and my entirety of knowledge on RegEx is confined to what’s on this page (Regular Expressions (RegEx) - Quick Reference | AutoHotkey v2). I test regex on RegEx101.

There are 2 problems:

  1. Potentially solvable by “good” RegEx. Below is my RegEx to process away various “spoken forms” of “comma” that I explicitly dictate (by the way, another disappointing thing is that having the word “comma” in the prompt does not entirely eliminate Whisper putting out “kama” … thus the need for regex indeed):

[,.!?]*[[:blank:]]*(comma|come on[,.]*|come out[,.]*|kama)\b[\s,.!?]*

Above works okay (case-insensitive).

  1. RegEx doesn’t help. If you even pause slightly, or often without any pause, Whisper may output something like “If RegEx, could solve the problem” …

How would I, pray tell, get rid of the “comma” in “If RegEx, could solve the problem …” above, which is placed ungrammatically by AI? OpenAI’s recommendation is to “pipe” it to another, more capable GPT for “proof-reading”.

Sigh … if this is not asinine, I don’t know what is.

But @RonaldGRuckus : thanks for piping in a comment and thanks for reading. The first type of difficulties I have minimized to an acceptable degree. The second type is infuriating. Now in some instances, I could anticipate Whisper “auto-puncting” after a pause; thereby eliminate it potentially through a RegEx inline command “no punk” after I pause to deliberate a sentence. But what about the punctuations Whisper adds willy-nilly without any perceptible pause by the speaker?

That’s why I am wondering why they couldn’t put out a version of the model with “no punctuation”. By the way, Dragon Naturally Speaking, the hitherto “industry standard”, doesn’t do “auto punctuation” by default.

One thing to consider here is that Whisper has the power of language models and long sequences of audio and training text. Part of the quality is that it can understand the patterns of speech and how it has been labeled, and what makes a word and sentence.

There is not going to be any easy switch to throw, because the AI is trained on producing transcripts similar to a preposterous amount of training data. That it can produce something that is immediately useful, with commas where they’d belong in natural writing, is a skill and feature.

You could have a language AI perform transformation into the form you want if simply stripping commas or semicolons isn’t enough for you.

Right. If I understand you correctly (again please accept the limitation that I am by no means steeped in programming knowledge), you are suggesting using “AI” on “AI” as a solution?

From this example here: Force no punctuation · openai/whisper · Discussion #589 · GitHub, it seems to me that those who run their own offline or hosted instances of Whisper CAN suppress a whole list of “symbols” including all common punctuations as a matter of “pre-processing”.

That would make my job of post-processing through regex considerably easier.

I am just asking why OpenAI couldn’t make this option available to users who want to approximate more of a “Dragon style” of dictations, for the lack of a better term.

Thanks for piping in!

P.S., is it possible to use --suppress_tokens=0,11,13,30 yourfile.mp3 style of flags on the Whisper hosted by OpenAI? It doesn’t seem to be documented anywhere from what I can find.

Here is one solution for you. I get the timestamped words as the response format, one of the features of OpenAI’s hosted version of Whisper. Several have complained that this DOESN’T have the original punctuation, just words.

Then assemble just the words back into a stream separated by spaces, and save them.

Python code (using requests library):

import os
import requests

audio_file_name = "joke.mp3"

# Gets the API key from environment variable
api_key = os.getenv("OPENAI_API_KEY")
headers = {"Authorization": f"Bearer {api_key}"}
url = "https://api.openai.com/v1/audio/transcriptions"

with open(audio_file_name, "rb") as audio_file:
    parameters = {
        "file": (audio_file_name, audio_file),
        "language": (None, "en"),
        "model": (None, "whisper-1"),
        "prompt": (None, "Here is the comedy show."),
        "response_format": (None, "verbose_json"),
        "temperature": (None, "0.1"),
        "timestamp_granularities[]" : (None, "word"),
    }
    response = requests.post(url, headers=headers, files=parameters)

if response.status_code != 200:
    print(f"HTTP error {response.status_code}: {response.text}")
else:
    transcribed_text = response.json()['text']  # the normal text return
    words = response.json()['words']
    plain_words = ""
    for word_object in words:
        plain_words += (word_object["word"] + " ")
    # Save text or words to a file
    base_file_name = os.path.splitext(audio_file_name)[0]
    try:
        with open(f"{base_file_name}_transcription.txt", "w") as file:
            file.write(plain_words)
        print(f"Transcribed text successfully saved to '{base_file_name}_transcription.txt'.")
        print(f"Sample: {plain_words[:320]}")
    except Exception as e:
        print(f"output file error: {e}")

Result transcript:

I remind you of a joke I know you’ve heard this joke At this point where the lady is caught by the cop the cop comes up to her and says Lady you were going 60 miles an hour And she says That’s impossible sir I was only traveling for seven minutes Well of course it’s ridiculous How can you go 60 miles an hour when I wasn’t going an hour And of course the question is How would you answer her if you were the cop Well if you were really the cop then no subtleties are involved It’s very simple You say Tell that to the judge

1 Like

Thanks very much! Your method works. I will either use your approach or going back to wiping out all punctuations from the get-go prior to post-processing, sparing only exclamation mark (!), which seems to be the only punctuation that Whisper would translate from verbatim to symbol on its own.

The goal is not perfection, but “good enough”.

@_j , I learned something useful. Thanks again for your help!

1 Like