Trying to use gpt-4 for correcting, but it summarises

I’m trying to use gpt-4 for correcting transcripts generated with whisper-1 as described by the documentation here:

However, when I try to use gpt-4 for correcting a transcript, it returns summaries.
I’ve tried various prompts to get gpt-4 to correct the text instead of summarise, but to no avail. Any ideas how to get gpt to correct?

Please share, exactly what you’ve tried.


  1. System message
  2. Text to be corrected
  3. Erroneous output text
  4. Ideal, correct output text

Then we will be much better prepared to help you debug the issue.


You can try One Simple Prompt.

Increase quality: {transcript}

This is should be a straight forward task.

I wrote an example in playground a while back, feel free to use it.

1 Like

Here’re the params I’m using:

{ "role": "system", "content": "Correct the following transcript generated by OpenAIs Whisper-1 model" },
{"role": "user", "content": transcript}

The full input text is rather long, but can be found in pastebin dot com / 2pKXJUdR

Here’s the output I get

Very important property of our IOMonad that we need to have a escape hatch. And the escape hatch is unsafe because it’s a side-effecting method. It runs things, right? Anytime we have an IO of A or a task of A and we run it, we’re gonna get an A back, but along the way, we’re gonna perform our effects, whichever effects we’ve captured, right? And that’s a fundamental problem of programming with pure values because we know fundamentally our programs are impure, right? We’re going to have to sort of step over to the dark side at some point, right? But the hope is that we can minimize the amount of time that we’re on the dark side, right? If we’re gonna cross the border and just sort of, you know, go into Mordor, we kind of want that to happen as infrequently as possible, you know, optimally like once. And unsafe perform-sync, sort of by its naming and in other sort of best practices, certainly encourages us to do that. And if you’re in an effectful system and you’re working with an IOMonad, this is the type of construct you would want to minimize the use of and try to maximize the extent of your program that is captured within the IOMonad itself. So if we give a type signature to this, right? Like I sort of walked through it in English, but the type signature here is also extremely important, right? So

Tried adding prefixing user content with "Increase quality: ", as well as setting it as the system prompt, but I still get just a summary back.

The output is not a summary, it is truncation at the maximum context length remaining.

The full input text is not just “rather long” — it is 7879 tokens!

The output you paste is 306 tokens

7879 tokens + 306 tokens =8185 tokens. The context length of GPT-4 is 8,192 tokens.

The only model that would satisfy this is gpt-3.5-turbo-16k-0613, 16,385 tokens context length, with the output not limited to only 4k like new models.

so we run. I get the exact effect seen before on such large context tasks: Nearly zero difference. When 8k input is maxed for rewriting, the only thing the AI does regardless of instruction is produce the same output right back at you.

So just like I find your input to have ellipsis from chunks of audio, you must process this in chunks of tokens, more like 700, to give the AI maximum capability to improve the quality without the urge to reduce the input.


So, it’s not just the audio track that needs to be split into smaller chunks, the transcript as well in the correction phase. Got it, thank you! :+1:

For the transcript correction phase, what model would you recommend using with long inputs? Should I prefer a model that supports the longest inputs or use gpt-4 and split the transcript into smaller chunks?

Given that input and prompt share tokens, what’s the recommended use of prompt in this case to provide context from the previous input, and in particular tradeoff between prompt and output length?

How do I estimate number of tokens needed for a given prompt + output and thus determine input chunk size?

As for the speech to text conversion, when feeding the audio clips of a longer track to whisper-1, what should be used as prompt for the next clip? The whole text so far or just the previous clip? whisper-1 seems to accept prompts of at least up to 58 kB in size.

What’s the recommendation for audio clip size? I was thinking of using the maximum allowed by the API, 25 MB, but are there other considerations to take into account?

Thanks for your helpful reply!
Still have some additional questions, in particular regarding the use of chat_history in your sample code.

How should I in practice feed context to the corrections API when the full text is too long to be processed in one go?

Does the following represent valid use of system, user and assistant messages for this use case?

Chunk #1 / corrections API call

{ role: "system", content: "You are an English expert ..." },
{ role: "user", content: "<chunk #1 of transcript>"},

Chunk #2 / corrections API call

{ role: "system", content: "You are an English expert ..." },
{ role: "user", content: "<transcript chunk #1>"},
{ role: "assistant", content: "<prior context (e.g. full running summary or summary of just transcript chunk #1)>"},
{ role: "user", content: "<chunk #2 of transcript>"},

Chunk #3 / corrections API call

{ role: "system", content: "You are an English expert ..." },
{ role: "user", content: "<transcript chunk #2>"},
{ role: "assistant", content: "<prior context (e.g. full running summary or summary of just transcript chunk #2)>"},
{ role: "user", content: "<chunk #3 of transcript>"},

Should the same system message be repeated with each API call?

What’s the best way to provide context: running summary of all previous chunks, summary of just the previous chunk or short fragment of raw input from previous chunk?

Should context from previous calls be passed as assistant or user message?
If assistant role should be used, what would the corresponding user message then be? The full request from previous round?

I depicted chat history from my “how to chat” template, just to remind of the placement of prior user and assistant messages. Here’s quick chatbots using either old or new python library methods. They employ chat history limited to a few past turns:

If you are using advanced techniques on the audio, such as overlaps when transcribing, there might be some clever use of showing the prior chunk so the AI can continue where it left off without duplication.

For improving the writing quality of individual chunks (that should be split at whole sentences at least), that history of prior turns should not be needed. It would just be a distracting expense.