This tutorial / lab experiment is a continuation of an earlier tutorial, Fine-Tuning In a Nutshell with a Single Line JSONL File and n_epochs, where we focused on insuring a single-line JSONL file would get the “expected” results from the same completion prompt used in the training file.
During that tutorial, our machine learning experts correctly commented that by using 16 n_epochs
the tutorial (me) had inadvertantly created the ML sin of overfitting. @georgei cornered me in a good way to test his example prompt for overfitting, and he correctly guessed (as others did) that the fine-tuning was overfitted.
Here, I continue that that discussion with a focus of model fitting using the same prompt for fine-tuning as before and will the model fitting with the fine-tuning prompt and @georgei fitting test prompt.
Single-Line JSONL Fine-Tuning Data
{"prompt":"What is your favorite color? ++++", "completion":" My super favorite color is blue. ####"}
The Georgi Model Fitting Prompt
Tell me what is your favorite color by naming an object with that color.
In our first test, I fine-tuned the davinci
base model as before but used 12 n_epochs
versus the underfitted 8 or overfitted 16 in the earlier OpenAI lab tests. Seemed like a good place to start, halfway between over and underfitted.
Fine-Tuned Model, 12 n_epochs
Testing the Georgi Prompt Setup
Testing the Georgi Prompt Results A (Success)
Testing the Georgi Prompt Results B (Success)
Testing the Georgi Prompt Results C (Success)
So, after around 10 completions, showing only 3 here, the 12 n_epochs
fine-tuned model scored 100% on the Georgi model fitting test prompt.
But what happens if we return to the prompt used in the fine-tuning? Care to guess?
Testing the Fine-Tuning Prompt Results A (Close Fit)
Testing the Fine-Tuning Prompt Results B (Underfitted)
Testing the Fine-Tuning Prompt Results C (Close Fit)
Lab Results
The Georgi prompt always returned an expected reply. However, the original prompt had mixed results. I ran this quite a bit, and my rough guess is the the original prompt returned good results about 70 to 80 % of the time.
This indicates that using 12 n_epochs
may be slight underfitted.
Next, later on in this caper, I will try 13 n_epochs
to see if we an fit this puppy in a way which could make a Google or OpenAI ML expert proud
Baking 13 now …