Whisper: how do I make the model output punctuation as punctuation, rather than transcribing the words?

I’d like it to behave more similar to e.g. google VTT - if I say, e.g. “This is the list colon newline dash First item newline dash second item exclamation mark”, I want it to output:

“”"
This is the list:

  • First item
  • Second item!
    “”“”

Instead, it faithfully outputs “This is the list colon newline dash First item newline dash second item exclamation mark” :slight_smile:

Tried playing with the initial prompt with limited success, since it’s not instruction-tuned, I have a hard time giving it an example that shows it what I want it to do…

AFAIK Whisper “listens” an audio that should put in words (transcribe) if they are understandable as per its training.
So, as you said if its not instruction-tuned/trained then it can not follow specific instructions.
A workaround would be to pass the output from whisper to a LLM that can much better process a text based on instructions you give.
My case: Using Whisper I transcribed a tutorial video in which its explained how to use an ERP, and in the part where tutor mention the menu path (he just lists the menu/submenus while showing in video) whisper just transcribe is as a list of words… but when i send to GPT4 to create a FAQ from it, even without specific instructions, GPT4 “understood” that text was about explaining the menu/submenus selections and wrote adequately in its output.

Now, based on your post, I tested this on a LLM:
i have this text transcribed from audio:
‘This is the list colon newline dash First item newline dash second item exclamation mark.’
but i need it to be written using the corresponding punctuation mentioned there and with adequate format.

the output from LLM was this:
Certainly! Here’s the text transcribed from audio with the corresponding punctuation and proper formatting:
"This is the list:

  • First item
  • Second item!"

I have used a colon after the word “list” to indicate that a list is about to follow. Each item in the list is preceded by a dash, and a newline is used to separate each item. Finally, I added an exclamation mark at the end of the sentence to match the “exclamation mark” mentioned in the audio transcription.

This case is also discussed, shortly, in the docs

  1. Sometimes the model might skip punctuation in the transcript. You can avoid this by using a simple prompt that includes punctuation: “Hello, welcome to my lecture.”

Since you have a unique use case adapting this may be necessary.

https://platform.openai.com/docs/guides/speech-to-text/improving-reliability