Fine tuning completation

https://beta.openai.com/docs/guides/fine-tuning

Before posting I did read the above post.
My goal is to present about two pages of data with about a 2 paragraph summery. I have a collection of these and I want to train completion to read input data and to summarize it in my example format. Do I need to make a fine-tuning set then create a divinci instance trained on the data then run a completion on against that specifically trained divinci? Am I going down the right path?

It seems that the dataset I am using is pretty large, so from the docs above if I understand correctly, I add the raw data as the prompt and my summarized version of the data as the completion correct?
Anyone know the limits on these prompts?

I expecting the finetuning can also have Divinci respond in a peticular way / format Hopefully

Thank you for any help!

Update I was able to make a little progress in getting my finetuning AI generated. I am reviewing the DOC and my api call dose not play nice targeting my newly trained AI in the model param of completion. I think I may have uploaded my training file (About 400kb) a few times because I got charged 9 dollars with divinchi, I may have done it only two to three times so Ill have to watch how often I click that button :wink:

Update2 I see two now and each one cost 330,104 trained tokens ($9.90)

Is there a way to calculate tokens before executing it? Just to ballpark cost?

1 Like

For summarization you don’t need DAVINCI. Start with CURIE. Your training samples should look something like this:

{'prompt': '<<two pages here>> \n\nSummary:',
'completion': ' <<two paragraph summary>>'}

I’ve started experimenting with putting short instructions at the beginning just for clarity. I don’t know if it has much impact but it doesn’t seem to hurt.

{'prompt': 'Summarize the following: \n\n <<two pages here>> \n\nSummary:',
'completion': ' <<two paragraph summary>>'}

With summarizations it really helps if you add some adjectives around the summary as well. Words like:

  1. Concise
  2. Detailed
  3. Thorough
  4. Brief
  5. Etc

That is very interesting, Ill get some sample data together and post here. Thank you for the tip!

So I created about 13 samples and fine tuned a curie. I know 13 is a small amount but I replicated one of the train files specifically in the playground with the newly trained curie and the format wasnt correct. If I need 100s if not 1000’s of examples then Ill have to code some web spider to format this data out as it will be pretty rough doing it by hand =\

I make great use of synthetic data (creating samples with GPT3).

Got it, then I take it you review and tweak the data then retrain on it correct?
To confirm, all retraining’s are from fresh data or is there an append type of training, correct?

Also, do I need endoftext tag anywhere in my JSONL file generated? I’m using the Prompt, Competition json keypair at the moment without them.

Here’s my latest big synthetic data project. This might give you more information than I can convey in a forum post :slight_smile:

1 Like

Thanks, the last couple of days I’ve been on youtube falling asleep to GPT3 fine tuning searching and I did not see this at all on the results I got. Ill give this ago! The videos on youtube are helpful but they seem to mostly stick to linux python, I am writing my own .NET library to work with openAI (for the exp) so getting info at that level is pretty rough when it comes to video sources so Ill take any help I can get.

Currently, I am spidering the dataset Im going to use to train curie and am pushing around 700 examples at the moment. Im trying to find a way to generate more, any who Ill review this content and come back with other questions I have.

1 Like

@Agent are you using your fine tuned model to create the embeddings too?