Hello!
I’m continuing my work on a note-taking app that uses OpenAI to generate summaries, but I’m still struggling to achieve consistent summary lengths. You may recall my previous post where I discussed my challenges with controlling summary lengths despite extensive prompt engineering (you can search for “Can’t get a model to follow a specific length / word count”).
After that, I tried fine-tuning a model with a dataset specifically designed to achieve a 50% summary length, but the results were disappointing. The error rate for the base model was 27%, but after fine-tuning, it jumped to 57%. Many advised me that what I’m attempting might be nearly impossible due to the limitations of LLM architecture. However, I received a promising suggestion to fine-tune a model for specific lengths instead of a variety.
I followed this advice and fine-tuned a model on a dataset with notes around 80 words, aiming to produce summaries of 40 words. Unfortunately, this also backfired, with the baseline model’s error rate of 8% worsening to 54% after fine-tuning.
Now, I’m wondering if I might have made a mistake in my fine-tuning process. Here’s what I did:
-
I had 134 original notes, each roughly 80 words long.
-
I manually produced a 50% summary for each note, aiming for about 40 words.
-
I created a CSV with two columns: one for the user prompt and another for the assistant’s reply.
-
My prompt was:
"You are an agent that summarizes book passages. Your task is to condense the content to half of its original length while maintaining the core message.
Guidelines:
- Do not include any introductory phrases.
- Directly state the main ideas and arguments succinctly and accurately.
- Your summary should present the main ideas and arguments clearly, without generalizing or being vague.
- Ensure the summary is no more than 50% of the length of the passage.
This passage has around 80 words; your goal is to provide a summary with around 40 words.
The passage to be summarized is below: “XXX”
-
I converted this CSV into a
.jsonl
file. -
I uploaded this as training data without using any validation data.
-
I used a random seed and kept all hyperparameters (batch size, learning rate multiplier, and number of epochs) at their default values.
Does it sound like I might have messed something up? I would really appreciate any feedback on what I might be doing wrong.
Thanks for your help!