How do I continue pretraining using the finetune API?

timpal0l · December 16, 2022, 12:31pm

I wish to continue the langugage modeling (the next word prediction task) on my own corpus. In all the examples I see in the fine tune API, it is shown how to perform e.g classification, summarization, etc.

The you need to format the data in pairs of prompt, completions.

In my case, it shall just be a domain adaption to the unannotated specific corpus I have.

Would one simply leave the prompt field to and empty string, and have the sentences / paragraphs in the completions field? I did try this approach, and the finetuned model became VERY repetitive and sort of overfitted to my 1000 unlabeled text documents.

(I am also wondering how the finetuning is done behind the scenes. Is it actually changing the weights of GPT3, or is it doing p-tuning on top of the froozen GPT3 weights, by adding a extra linear layer)

Thanks!

nelson · December 19, 2022, 3:16am

Sounds like you have a large corpus and you want to have GPT3 to output response from information in that corpus. I don’t think it is necessary to fine tune the model.
You can break down the text of your corpus in to paragraphs and convert them into embeddings.
On any given question, you can use semantic search to search for the most relevant embeddings, use those text as your context in your prompt and ask GPT3 your question.
That is probably the most efficient and cost saving solution.

Nelson
From Superinsight

sps · December 19, 2022, 4:42am

I believe breaking each chunk of text into two equal parts and assigning the first half to prompt and the second half to completion should do what you’re trying to achieve.

Topic		Replies	Views
Can I fine-tune without question/answer pairs? API fine-tuning	0	1173	August 7, 2023
Fine tuning for custom corpus of data? API	0	701	January 9, 2023
Fine tuning using a corpus API api	8	1966	July 13, 2023
Fine-Tuning with Non-Prompt/Completion Data: Seeking Advice for Direct Text-Based Training? API gpt-4 , chatgpt , fine-tuning , api	3	368	August 23, 2024
Fine Tune GPT-3 without prompt? API	2	2607	November 21, 2022

How do I continue pretraining using the finetune API?

Related topics