Hey!
I have a note-taking app that uses OpenAI to generate summaries for notes that people take of their books.
I have encountered a problem where it’s hard to get a good summary because the length of a good summary vastly depends on the length of the original text. Shorter texts should produce shorter summaries, and longer texts should make longer summaries.
But GPT struggles with this by default, and it is often a bit random. I often end up with summaries that are so long they’re not much of a summary, or so vague that they become vague and completely useless.
The perfect solution would be to customize the length of the summary in the prompt. For example, asking to summarize this text to 40% of the original length. The problem is that GPT is absolutely horrendous at doing this, even GPT-4. Instead of percentages, I also tried character counts, line counts, sentence counts, token counts, and a variety of combinations and variations. They all fail.
After a lot of frustrating trial and error, it seems that it’s an inherent limitation of these kinds of models. I’ve seen other people having the same problem. After countless hours of prompt engineering, I gave up and figured I’d try fine-tuning a model instead.
I created a dataset where I gave my usual prompt to summarize it to 50%, and then I made sure that the summary actually was close to 50% of the original text. I had a variety of texts with different kinds of books, and more importantly, different kinds of lengths. Anywhere from 100 words all the way to 500 words. With the assistant response always getting the length output right.
I’ve seen people mention that 50-100 data points are a good heuristic, so I went with around 60. I thought that with so many examples, across so many different contexts, surely a fine-tuned model would at least get a bit better. It didn’t. It came out horrible and made the model worse.
Average error rates:
GPT4: 22.35%
GPT4o mini: 27.26%
GPT4o mini fine-tuned: 57.35%
For some context, 22% doesn’t look particularly bad but that’s because it’s averaged out. In practice, some end up looking quite stupid. For example, even in just this small sample, one of the notes was 120 words. It should have outputted 60 words, but instead gave 98, which is pretty close to the original and not a summary at all.
I’m completely lost on where to go from here. This is quite important for my app to work well. The summaries are key for people to quickly remember important information, and without the correct length, many of them end up becoming useless, either too long to be practical (takes too long to read and people get lazy), or they’re so short that you actually can’t remember what the information was in any meaningful detail.
Any insight about this is super appreciated. Thank you!