First Fine Tune was kind of disappointing?


[TL;DR] looking for advice to improve creative writing in a given “lore” context with prompt engineering and fine-tunes.

I’m building a simple application to write bedtime stories for my kids using OpenAI.

I have built over the year a pair of fictional characters that are best friends (Fifi the girafe and Rhino the Rhinoceros) and go on a lot of crazy aventures.

My app is fairly simple, using a form with 3 fields, I build a prompt that looks like this :
“Write a short story for kids where Alice and Bob go to [location] for [purpose] with [companion]” for text and
“Kids book illustration of a girafe and a rhino doing [purpose] in [location] with [companion]”

I use ‘text-davinci-003’ as model for the story text and dall-E for the story illustration.

The results are super fun however I want to improve a few things :

  • Illustrations are nice but
    ** Dalle is GOOD at Girafes but Kind of sucks at Rhinos ??? Is there a better way to engineer the prompt to make sure there is a propper girafe and rhino in the image ?
    ** Sometimes, Image is not correlated with the story beause Davinci makes up an additional character or situation that doesn’t show in the image (especially when i ramp up temperature) → Simple enough I should feed Dall-E with the completion from davinci and not the original prompt, however I’m affraid the prompt limit for dall-E will “cut” in the story and leave an unfinished prompt. Any idea around that ? is the 400 characters limit of the Dall-E web demo available for API as well ?
  • Texts are mostly fun and well-written however :
    ** There are a lot of repetitions in some of the stories. Could this be linked to the fact that this app is in French and using French prompt and output (btw can I expect better result with prompts in English and asking for the model to translate it’s completion afterwards ?)
    ** Davinci is OBVIOUSLY not familiar with Years worth of Fifi and Rhino adventures and for instances “wastes tokens” by telling me how great friends they are and were they live (duh, we all know that already :wink: )

To improve on that last point, I have build an “admin page” where I can manually Edit (proofreading, repetition removal, context rewriting) and save the best stories in a JSONL file, upload it and use it to launch a fine tune.

My first fine-tune model got out of the oven this morning and I tested it right away but WEIRD things happened :crazy_face:

  • Completion doesn’t seem to stop until the token limit is reached, whereas the original model writes a story and ends completion with it. Here the model writes the story, finishes it with “FIN” and goes on with the same story untill it runs out of tokens… I tried to add the stop= “FIN” in my Calls but it just removed the “FIN” and kept going… → what could cause this behavior and should I call this fine tuned model differently from the original ?
  • Context is now taken into account (no more setting up the characters relationship or history) so at least this is a success, but repetition got worse (this is a “feeling” not yet backed by enough data).
  • Length of the fine tuned stories are more varied than original model (which is a good thing I guess) but longer stories are really repetitive and all the examples in the training data were short and about the same length → gain, not really sure what would cause that.

The dataset consisted of only 20 prompt/edited completion groups so I guess it could explain the somewhat disappointing result.

What are your thoughts ?

  • bigger dataset : how big are we talking ? 100 stories ? 1000 ?
  • fine tune job : could you suggest some parameters to play with to improve on the expected results ?

Edited : one last question, could I “save” on prompt tokens by just providing the triplet (location, purppose, companion) as structured data to get the same completion ?

Anyway thank you so much to all the OPENAI community, this has been an incredibly fun and fruitful journey so far !! :heart_eyes: