Fine tune model with empty prompts

I have at my disposal thousands of Word documents and I would like to train GPT with them in order to create a fine tuned model.
I was planing to extract the text of these files to help GPT to offer completions in the “style” of those aforementioned documents.
In principle, I do not have a simple way to associate/automate prompts for the extracted text chuncks so I was thinking of feeding the model with paragraphs extracted from the text and an empty prompt.
I lack experience in training AI models so I was wondwering if this makes any sense at all :slight_smile:
Any help will be appreciated!!

1 Like

Welcome to the community.

Yeah, that’s the most recommended way - empty prompt field and 1k to 1.5k tokens in the completion field…

Hope this helps.

Thanks for your fast response!!
What are your thoughts about the strategy to get the chunks of text for training.
Namely, one simple strategy will get a number of paragraphs until the reach the 800 word limit or so (because of what I have read before that is what we would expect to be in the token range you propose). Another one, would be to cluster text by headings in the document so the content will be “more coherent”…but I do not know if that would help GPT to “understand better the text” or this is just a missconception about how to train transformers.
As you can see I am a newbie in the field :slight_smile:

1 Like

I’m by no means an expert either, so no worries.

Cleaning the dataset is difficult but an important step of the process. What you might do is automate it as much as you can (strip 1000 tokens at a time and write to the json…) and then manually look over the dataset once it’s done.