GPT3 Fine Tune Data

I’m using short stories I wrote to bias the “voice” of what GPT3 generates. I’ve tried to format data in multiple ways, but maybe before I spend another chunk of money someone already went through this and has tips to share.

So far I tried few strategies.
All of them use the general empty prompt format:
{“prompt”:"", “completion”:""}

  1. I’ve tried taking several paragraphs with dialog and everything, replace every new line with \n and then use each as a single example. Each training example was approximately a page worth of text. So the number of samples was rather low ~ 100-200.

  2. Getting each paragraph as a separate sample, which brought number of samples to ~1000. (so far this was actually the most successful)

  3. Separating dialog and exposition into two training sets that I switch between depending on what part I want to write. This didn’t work as expected for some reason.

  4. I haven’t tried getting each sentence as individual training data.

  5. Does randomizing the order of training samples makes sense?

Would appreciate any input.

2 Likes

@daveshapautomator should be able to help. The excellent videos on his channel over at YouTube are instructive. Search for “David Shapiro”

3 Likes

Achieving coherent fiction as we imagine it is perhaps one of the most illusive tasks yet. I would recommend following the documented example of finetuning where you leave the prompt empty and put the whole story (or a big chunk of it) as the completion. This requires more samples but it can finetune the model on your style and tone.

Alternatively you might try having one paragraph as the prompt and the subsequent paragraph as the completion. Apparently AI Dungeon uses a “lore book” mechanism so that the major details of the story can be referenced at all times, for each completion. With this, you might be able to have a lore book section of the input prompt as well as the previous paragraph(s). And then you can have the next paragraphs as the completion.

Hmm… This gives me some ideas. I might resurrect my AutoMuse project and create some training data based on fiction from Gutenberg. This could be fascinating. Stand by, this will take a few days but I’ll post the code and video within a week.

7 Likes

There you have it; your solution (possibly) and another fantastic video for us to watch. :slightly_smiling_face:

j/k @daveshapautomator,
I find your new channel quite good, have it configured to send notifications as soon as you post…

3 Likes

Thank you David! And thank you @vaibhav.garg for making me aware of David’s Youtube channel! So much fascinating information there.

1 Like
2 Likes

OMG, David. This is just beyond what I’ve expected. This clearly answers my question about how to break up the data, but most importantly completely changed my thinking about how to go about writing using GPT-3. I’m really grateful for this video!

One question I have, how exactly does fine tuning changes the model. Does it even make sense to train the model with empty prompts? (I was planning this in order to retain “writing style” mainly) I haven’t even thought in direction of using it to write a full story, but mainly help when I get stuck while retaining style.

1 Like

Fine-tuning is basically transfer learning as far as I know. So if you want to know how finetuning works under the hood, you might like to look up transfer learning. I would post more but I’m on mobile.

2 Likes

Yes it makes sense to FT with empty prompts for voice. Very effective.

1 Like

I’ve been doing a lot of work on this too. I treat GPT3 as a writing partner - and what you’d never do with a writing partner is type “once upon a time” and then publish the first thing that comes out of their head.

Instead, separating out the various areas of fiction writing (story, plot, atmosphere, character, voice) etc. and working on them independently, then bringing them back together (with your own imaginative work) and working through paragraph by paragraph, idea by idea, sentence by sentence is a much more useful way to think about gpt3.

I’m not expecting writing with AI to be any easier than writing without it. What I’m looking for is quality - I’m looking for ideas that neither me nor the machine could have come up with on our own.

As for training, I’ve found (so far) that text-davinci-002 is so far ahead of davinci that it’s better at working with voice from a standing start than anything I can train on raw davinci. Anyone had different experience?

2 Likes

This is excellent stuff.
For when you come to do part 2, here are a couple of suggestions:

You probably don’t need to prompt it with the whole detailed “story so far”. Most stories will concentrate on what’s happening right now, and only bring in important details from what happened 3 chapters ago if it’s vital, so why not only give it the last (say) 5-10 chunk summaries. The prompt already has the basic story.

If you wanted to go deeper you could list out characters appearing in each chunk, then somehow give the prompt a collection of chunk summaries based on the characters in the scene (so they’d have the detail on the things they’d just done, but not so much on the scenes they weren’t in). Might be a little ambitious!

When thinking about making story premis/scene breakdowns, yours come out great. What I’ve done in addition is got gpt3 to pull out a cast of characters from my premis, and list them along with character conflicts/flaws/contradictions.

This gives you a little more depth and opportunities for subplots, and this, in turn gives davinci a little more to work on when it comes to doing a breakdown into “save the cat”, “three act structure”, “hero’s journey” breakdowns and scene lists.

By character contradictions, I’m talking about internal dilemmas, So a story about a slave trader might generate a main villan who’s putting all their money into animal charities, or a hero with a drug addiction. Doing this just gives gpt3 something extra to play with for colour and detail.

1 Like

@daveshapautomator
Patrick Rothfuss (Author of The Kingkiller Chronicle) offered to release the third book of the trilogy as a series of bulleted lists.

Maybe someone here can put him in contact with you to generate the third book.

2 Likes

Yes please! This would be an amazing project.

I’ve got the first two books in txt format. Using code very similar to yours I couldn’t get anything useful using the cheaper base models and training against davinci would have been too expensive for me. If you’d like those files I’d be glad to share them with you.

1 Like

That’s a good start but I think we should get the author’s permission first.

1 Like

I’ve done this on various books from the bible, all of a journalist friend’s articles, the Big Book of AA, and Moby Dick. Have done it line by line so far. Have done it using one line as a prompt and the next as a response, and also by leaving prompts blank. Both seem to work well. Glad I found this thread as I’m intensely interested in this subject. Following!

Great video, thanks for showing your process and code!

Would you mind sharing if you have special permission from OpenAI to make YouTube videos which share GPT-3 output?

The video is private, I wish I could watch that once