I wish to continue the langugage modeling (the next word prediction task) on my own corpus. In all the examples I see in the fine tune API, it is shown how to perform e.g classification, summarization, etc.
The you need to format the data in pairs of prompt, completions
.
In my case, it shall just be a domain adaption to the unannotated specific corpus I have.
Would one simply leave the prompt field to and empty string, and have the sentences / paragraphs in the completions field? I did try this approach, and the finetuned model became VERY repetitive and sort of overfitted to my 1000 unlabeled text documents.
(I am also wondering how the finetuning is done behind the scenes. Is it actually changing the weights of GPT3, or is it doing p-tuning on top of the froozen GPT3 weights, by adding a extra linear layer)
Thanks!