Fine-Tuning a Fine-Tuned Model Causes Weird Training Dynamics

I wanted to fine-tune a model for 20 epochs, but I want to test the performance of both the final model and the 10-epoch model on some downstream task. My initial approach was to do two separate fine-tuning jobs; one to 10 epochs and one to 20 epochs (using the same data for both). But then I thought it would be more efficient (and cost less) to instead fine-tune to 10 epochs, and then fine-tune that fine-tuned model for an additional ten epochs, thus yielding two “checkpoints” so to speak. However, training dynamics in the second fine-tuning job were odd, as depicted below:

Whereas when I just fine-tuned the model to 20 epochs directly, the training progress was smooth.

This irregular behavior makes sense since (I’m guessing) the learning rate resets to the initial value (which should decay over the course of training) + there’s probably warm-up stuff happening at the beginning of the fine-tuning job. But if these are true, then fine-tuning a fine-tuned model seems much less powerful than it could be. Of course, it would still be useful if you wanted to fine-tune your fine-tuned model on some new task, but I don’t see this as being a more standard use-case than the “continuing fine-tuning” use-case that I described above.

My questions are:

  • Is what I said above true/have others experienced this?
  • Are there ways for me to work around this issue? I considered two things:
    • using model checkpoints but currently only the last three checkpoints are returned at the end of the fine-tuning job (and the user has no control of which checkpoints to keep)
    • passing in the downstream task as the validation set, even though it’s not validation data in the conventional sense (i.e. under the same distribution as the fine-tuning data). Also, I’ve noticed that the validation accuracy is sometimes reported at not-useful intervals in the results csv file, and it’s unclear if accuracy is reported on the whole valid split or a batch (my observations are highly inconsistent with the description in the docs)
  • Is OpenAI working on stuff that will fix these issues?