Update for those who read the original:
We have moved into making the fine tunes write fiction.
We are learning that training loss is a meaningless metric in the use case of fiction writing. We are getting great results from training loss numbers in the 1.8+ area and all going “Um… what does any of this even mean?” LOL.
We are using 10-24 examples of writing and finding that a fine tune on 3.5 16k can:
- write the full 4096 output length very easily (we get 2,000-3,000 word responses from GPT3.5!)
- makes the hyperparameters of Temperature, Top_P, Frequency and Presence penalty VERY fragile. A slight bump in Frequency or Presence Penalty and you get gibberish. Lower Temperature works best too, like .3-.7, on most fine tunes.
- we are using GPT 4 Turbo to analyze our human writing to understand the keywords the AI thinks our writing is (as I write this, I realize we should probably start using 16k, since the goal is to meet the model halfway where it already works)
- fine-tuned 16k models can write unattributed dialogue (this is a benchmark we have, a piece of dialogue that is not Jane said or exclaimed, just “What do you want to eat?”)
- fine-tuned 16k models can write hilarious character interactions (I am starting the YT at the 42 minute mark where you can see us compare regular to the fine tune, and how fragile the settings can be, and you know, juggling lobsters
) https://www.youtube.com/live/ctNys4Jl6ME?si=dT8fOWZiB8ooe27D&t=2862
We are really jazzed with some of the results and we are able to make fine tunes that follow scene briefs and stick to an author’s voice and style. Many authors now have fine tunes that are writing responses we can’t tell if it’s human or AI.