Fine-Tuning a Model for Specific Length Output

Hello!

I’m continuing my work on a note-taking app that uses OpenAI to generate summaries, but I’m still struggling to achieve consistent summary lengths. You may recall my previous post where I discussed my challenges with controlling summary lengths despite extensive prompt engineering (you can search for “Can’t get a model to follow a specific length / word count”).

After that, I tried fine-tuning a model with a dataset specifically designed to achieve a 50% summary length, but the results were disappointing. The error rate for the base model was 27%, but after fine-tuning, it jumped to 57%. Many advised me that what I’m attempting might be nearly impossible due to the limitations of LLM architecture. However, I received a promising suggestion to fine-tune a model for specific lengths instead of a variety.

I followed this advice and fine-tuned a model on a dataset with notes around 80 words, aiming to produce summaries of 40 words. Unfortunately, this also backfired, with the baseline model’s error rate of 8% worsening to 54% after fine-tuning.

Now, I’m wondering if I might have made a mistake in my fine-tuning process. Here’s what I did:

  1. I had 134 original notes, each roughly 80 words long.

  2. I manually produced a 50% summary for each note, aiming for about 40 words.

  3. I created a CSV with two columns: one for the user prompt and another for the assistant’s reply.

  4. My prompt was:

    "You are an agent that summarizes book passages. Your task is to condense the content to half of its original length while maintaining the core message.

    Guidelines:

    • Do not include any introductory phrases.
    • Directly state the main ideas and arguments succinctly and accurately.
    • Your summary should present the main ideas and arguments clearly, without generalizing or being vague.
    • Ensure the summary is no more than 50% of the length of the passage.

    This passage has around 80 words; your goal is to provide a summary with around 40 words.

    The passage to be summarized is below: “XXX”

  5. I converted this CSV into a .jsonl file.

  6. I uploaded this as training data without using any validation data.

  7. I used a random seed and kept all hyperparameters (batch size, learning rate multiplier, and number of epochs) at their default values.

Does it sound like I might have messed something up? I would really appreciate any feedback on what I might be doing wrong.

Thanks for your help!

If you were finding a starting error rate of 8% why not run the model 2+ times and then get the correct output?

If you were to fine-tune this behavior it would make sense that initially you would have a massive increase in errors, and then it would eventually come down once the pattern is matched and worked on.

Although I doubt it would ever be perfect. The model doesn’t know how many words it has created, or have the foresight to prepare the sentence to match the wordcount. You would probably be better off aiming for # of sentences.

Then I would still run it N times and then programmatically pick the output that is closest.

2 Likes

I didn’t want to get into the details to avoid making the post overly complicated, but the 8% error rate is a bit misleading and still not usable:

  1. It’s an average. Sometimes the error is quite significant in individual queries.
  2. Having a second query usually makes it better, but not enough to solve the problem.
  3. The 8% error rate applies specifically to this word count, which the model handles better. For longer texts, which I also need, the performance is usually much worse. For some context, in my earlier thread, the error rate was 27% (Again, with the caveat that it’s an average. In just a small sample, I had one with 80%). I started with this word count as a test to see if fine-tuning would improve the model at all, unlike my previous mixed approach, which was ineffective.

Despite the figures, which can be misleading due to nuances like the above, the crux of the matter is that I need to make the model better, and fine-tuning isn’t working as expected.