How do I continue pretraining using the finetune API?

I wish to continue the langugage modeling (the next word prediction task) on my own corpus. In all the examples I see in the fine tune API, it is shown how to perform e.g classification, summarization, etc.

The you need to format the data in pairs of prompt, completions.

In my case, it shall just be a domain adaption to the unannotated specific corpus I have.

Would one simply leave the prompt field to and empty string, and have the sentences / paragraphs in the completions field? I did try this approach, and the finetuned model became VERY repetitive and sort of overfitted to my 1000 unlabeled text documents.

(I am also wondering how the finetuning is done behind the scenes. Is it actually changing the weights of GPT3, or is it doing p-tuning on top of the froozen GPT3 weights, by adding a extra linear layer)


Sounds like you have a large corpus and you want to have GPT3 to output response from information in that corpus. I don’t think it is necessary to fine tune the model.
You can break down the text of your corpus in to paragraphs and convert them into embeddings.
On any given question, you can use semantic search to search for the most relevant embeddings, use those text as your context in your prompt and ask GPT3 your question.
That is probably the most efficient and cost saving solution.

From Superinsight

I believe breaking each chunk of text into two equal parts and assigning the first half to prompt and the second half to completion should do what you’re trying to achieve.