Can't get a model to follow a specific length / word count

Hey!

I have a note-taking app that uses OpenAI to generate summaries for notes that people take of their books.

I have encountered a problem where it’s hard to get a good summary because the length of a good summary vastly depends on the length of the original text. Shorter texts should produce shorter summaries, and longer texts should make longer summaries.

But GPT struggles with this by default, and it is often a bit random. I often end up with summaries that are so long they’re not much of a summary, or so vague that they become vague and completely useless.

The perfect solution would be to customize the length of the summary in the prompt. For example, asking to summarize this text to 40% of the original length. The problem is that GPT is absolutely horrendous at doing this, even GPT-4. Instead of percentages, I also tried character counts, line counts, sentence counts, token counts, and a variety of combinations and variations. They all fail.

After a lot of frustrating trial and error, it seems that it’s an inherent limitation of these kinds of models. I’ve seen other people having the same problem. After countless hours of prompt engineering, I gave up and figured I’d try fine-tuning a model instead.

I created a dataset where I gave my usual prompt to summarize it to 50%, and then I made sure that the summary actually was close to 50% of the original text. I had a variety of texts with different kinds of books, and more importantly, different kinds of lengths. Anywhere from 100 words all the way to 500 words. With the assistant response always getting the length output right.

I’ve seen people mention that 50-100 data points are a good heuristic, so I went with around 60. I thought that with so many examples, across so many different contexts, surely a fine-tuned model would at least get a bit better. It didn’t. It came out horrible and made the model worse.

Average error rates:
GPT4: 22.35%
GPT4o mini: 27.26%
GPT4o mini fine-tuned: 57.35%

For some context, 22% doesn’t look particularly bad but that’s because it’s averaged out. In practice, some end up looking quite stupid. For example, even in just this small sample, one of the notes was 120 words. It should have outputted 60 words, but instead gave 98, which is pretty close to the original and not a summary at all.

I’m completely lost on where to go from here. This is quite important for my app to work well. The summaries are key for people to quickly remember important information, and without the correct length, many of them end up becoming useless, either too long to be practical (takes too long to read and people get lazy), or they’re so short that you actually can’t remember what the information was in any meaningful detail.

Any insight about this is super appreciated. Thank you!

1 Like

I think it is a hard challenge, because it is not only about when to stop, but how to start and continue so whole writing style fits the final length, and it does not come a sudden stop. If it is an important challenge, I’d try to finetune a few models for different lengths. Then, when you get the original text, select the correct summarization finetune.

For example:
finetune for 200 words
finetune for 400 words
finetune for 800 words
finetune for 1600 words

The question to think about is if these finetunes should have varying lengths of inputs, or if you are sure you will always use each model with a specific length of input.

Remember to keep the dataset you used to finetune. You may need to finetune a newer model soon. At least after gpt-4o-mini reaches end of life. It may come quicker in LLM space than we expect, it has previously.

As a bonus I think GPT-4o-mini finetuning might still be free for a few days, not sure.

1 Like

A couple of thoughts on this as one of my solutions involves creating automated summaries from a huge variety of sources with OpenAI models.

  1. The models are not good at or capable of counting individual words - so no matter what you try in your prompt this won’t get you far. That said, what tends to work a lot better is to tell the model how many sentences or how many paragraphs to return. With that in mind, you could set up a dynamic logic that involves counting the number of sentence in your original text and then based on that dynamically specifying in your prompt the desired number of sentences for your summary. I have not tested it like this but it could be worth a try.

  2. I don’t know what type of content you summarize but I would be cautious about being overly fixated on the length of the summary and instead focus on the type of content that should be included in your summary. You can have a long input text but in some cases this text may include duplicative information for some reason (e.g. I deal with news that often include quotes from various individuals - often the content of those quotes overlaps with the core information). In those case, you would expect a summary that is focused just on the core information to be much shorter. Hence, instead of defining the target length, you can define the nature of the information that should be covered in your summary.

My own solution involves multiple steps. Among other things, it relies on dynamic, one-shot prompting whereby the instructions for summarization are dynamically tailored based on the topic category the news falls under and an example summary for a similar news is provided in the prompt for reference. Over time I have found that this has helped to keep the length in check.

4 Likes

I thought about this. Maybe worth a try, but this result with fine-tuning was so disappointing that it’s hard to get the motivation to do all the work again and even more ft-models. How many examples do you think I’d need for each?

1 Like
  1. I’ve tried this, it also didn’t work well unfortunately.
  2. I’m pretty happy with the summaries in terms of context honestly, it’s just the length itself that is causing issues.
1 Like

I have not tried with gpt-4o-mini yet, but I think some docs say about 200, after tahat meaningful improvement with every 2x.

Just a random idea came to my mind. How about you ask it to add [1] after every sentence? It would act as a note for the model itself. Then tell it to stop after X sentences. Before showing to user use regex [[0-9]] or other code to remove all count notes.

like this:

What is it that contributes to the length of the summary? For example, when the summary is too long, is it because there is an excessive use of filler words or certain type of unnecessary details included? Likewise, what’s missing when it is too short?

I would try to identify the common denominator across the summaries you are unhappy with and then try to play around with qualitative instructions to address these issues.

1 Like

I think that sounds like a great idea. Giving more details of what a “summary” means in each case could improve it. Maybe other ways to ask like synopsis or TL;DR; , key points could also help.

I don’t generally limit my posts by number of words, if anything I need my posts much longer. I tried an experiment that might help, but requires you doing extra work.

So I prompted:

I want you to summarize the Gettysburg address into 3 medium paragraphs.

It gave me 3 paragraphs, 263 words. I then prompted:

Analyze that content and use “‡” at the end of the sentence to indicate any sentence I could remove without affecting the overall quality. Don’t remove the sentence, just add the indictor only.

I call this “Indicator Prompting”. Useful in many other ways. In this experiment, it identified four sentences I could remove. I look at it more as 4 sentences to remove or edit. I managed to get the summary down to 206 words.

So I prompted:

Using “‡” an an indicator at the end of words in the write up below, I want you to identify 10 words I can remove safety without reducing the quality of this summary. Don’t remove the words, just indicate which which words within the paragraphs could be removed.

Even though I only needed 6 removed, I wanted some additional options. However, with this last one, I find that in order to remove those words, you have to remove some additional words with it for the sentence to make sense, but with my goal of being 200 words, identifying key areas in which to remove helps me get it that way. Again, this required 3 prompts and editing done in Word.

In the past, I have achieved exact word counts for GPT-generated texts by breaking the task into smaller steps. The first step was to generate the text, the second was to check the word count, and the third was to instruct the model to either continue, stop, or conclude. This was all done within a single prompt, aiming to generate exactly 125 words, for example. However, you can also split this process into multiple model calls, using a script to count the words instead of relying on the LLM.

I believe the concept is clear, and if this feature provides significant value, you’ll find that using multiple steps to write the final summary will help you achieve the desired result with ease.

4 Likes

Interesting approach, but several prompts doesn’t quite work for my use-case, it needs to be done at scale.

That would make it stop at a somewhat arbitrary point. It would destroy the quality of the summary.

The summaries are quite good, it’s just a length issue. There is no particular aspect that it consistently gets wrong, it’s just that the output length is inconsistent despite high quality.

1 Like

You ever figure out, please post your results. But given the nature of the LLM like ChatGPT, it thinks on a word by word basis so stopping at a specific point is like asking it to stop in the middle of a sentence. Maybe there is a way to do it. I wish you the best of luck. I know many others are also hoping for the same solution.

Exactly that wouldn’t work at all. The goal is not to cut it, but rather to generate a smaller piece of content.

Yeah, you’re missing my point entirely.

ChatGPT generates content on a word-by-word basis. You can of course get it to be smaller, but not to a specific word amount, because it doesn’t know what it will precisely say until says it. It has a general idea what it will say, but it won’t know which words it will use until it has finished generating a word. So, your goal to generate a smaller piece of content to a specific word count is for right now impossible.

I used the example of stopping in the middle of a sentence because that’s the closest it could get to a specific word count. However, it’s a nonsensical example because it wouldn’t be able to stop at the 200th word because it doesn’t count words as it generates them. It was meant as a silly example, not to be taken as a serious suggestion.

It’s a token by token basis (parts of words), which is part of the problem with counting. But yeah, not being able to “count” as it’s generating also makes it difficult.

There’s a new long-output model coming out, I believe, so OpenAI is aware of the limitations and working on them.

3 Likes

I know it’s token-by-token, which can be whole words or parts of words known as subwords, depending on how well trained ChatGPT is on a particular word. I’ve tested that in Python. I find it so fascinating that Byte Pair encoding, which is what ChatGPT uses, was actually developed in 1994.

Given he was wanting a specific word count and not specific token count, I was trying to keep at his level of engagement.

There are actually long-output LLM models out there. I actually have prompts that can generate long content ChatGPT, but it does still take several prompts, but the results is a continued discussion without hallucinations.

1 Like

Most f the commercial note taking apps use multiple prompts that get appended together, so one will check for who was present, one will look for key points, one for follow-ups needed, one for blah blah, you get the idea.

1 Like