Fine tune model with empty prompts

eramostorres1962 · January 4, 2023, 5:50pm

I have at my disposal thousands of Word documents and I would like to train GPT with them in order to create a fine tuned model.
I was planing to extract the text of these files to help GPT to offer completions in the “style” of those aforementioned documents.
In principle, I do not have a simple way to associate/automate prompts for the extracted text chuncks so I was thinking of feeding the model with paragraphs extracted from the text and an empty prompt.
I lack experience in training AI models so I was wondwering if this makes any sense at all
Any help will be appreciated!!

PaulBellow · January 4, 2023, 5:58pm

Welcome to the community.

Yeah, that’s the most recommended way - empty prompt field and 1k to 1.5k tokens in the completion field…

Hope this helps.

eramostorres1962 · January 4, 2023, 6:13pm

Thanks for your fast response!!
What are your thoughts about the strategy to get the chunks of text for training.
Namely, one simple strategy will get a number of paragraphs until the reach the 800 word limit or so (because of what I have read before that is what we would expect to be in the token range you propose). Another one, would be to cluster text by headings in the document so the content will be “more coherent”…but I do not know if that would help GPT to “understand better the text” or this is just a missconception about how to train transformers.
As you can see I am a newbie in the field

PaulBellow · January 4, 2023, 6:33pm

I’m by no means an expert either, so no worries.

Cleaning the dataset is difficult but an important step of the process. What you might do is automate it as much as you can (strip 1000 tokens at a time and write to the json…) and then manually look over the dataset once it’s done.

Topic		Replies	Views
Fine-tuning a model without using prompt-completion API fine-tuning	1	920	July 4, 2023
Should prompts be unique for fine-tuning? Prompting	9	1718	December 25, 2023
Fine-Tuning with Non-Prompt/Completion Data: Seeking Advice for Direct Text-Based Training? API gpt-4 , chatgpt , fine-tuning , api	3	418	August 23, 2024
Fine-tuned model handles prompts differently Prompting	6	939	November 23, 2023
Got awful results after fine-tuning API	11	3195	December 1, 2022

Fine tune model with empty prompts

Related topics