Finetuning dataset using python

I am facing issues converting the prompt completion data set into a jsonl file. I am keeping the dataset in a text file and then transforming it into a jsonl file.

my data set is below:
{“prompt”:“Property Usage=Principal Residence, Transaction Type=Purchase, Number of Units=1 Unit”, “completion”:“The Maximum LTV,CLTV,HCLTV for FRM is 95% and ARM is 95% for property usage Principal residence, transaction type purchase and number of units is one”}
{“prompt”:“Property Usage=Principal Residence, Transaction Type=Purchase, Number of Units=2 Unit”, “completion”:“The Maximum LTV,CLTV,HCLTV for FRM is 85% and ARM is 85% for property usage Principal residence, transaction type purchase and number of units is one”}

When it gets converted into jsonl:
{“prompt”:“”,“completion”:“{"prompt":"Property Usage=Principal Residence, Transaction Type=Purchase, Number of Units=1 Unit", "completion":"The Maximum LTV,CLTV,HCLTV for FRM is 95% and ARM is 95% for property usage Principal residence, transaction type purchase and number of units is one"}”}
{“prompt”:“”,“completion”:“{"prompt":"Property Usage=Principal Residence, Transaction Type=Purchase, Number of Units=2 Unit", "completion":"The Maximum LTV,CLTV,HCLTV for FRM is 85% and ARM is 85% for property usage Principal residence, transaction type purchase and number of units is one"}”}

The prompt and completion are not as per the fine-tuning model when it gets converted. Kindly help how to do a proper conversion.

Hey @ezhilv17 first of all you need to put a white space in front of the first character of your completion. You should also put a fixed seperator in between your prompt and completion and a “stop of the sequence” identifier at the end of your comletion.
Applying those changes , a single prompt-completion-pair of your training data would look as follows:
{“prompt”:“Property Usage=Principal Residence, Transaction Type=Purchase, Number of Units=1 Unit”, ++++“completion”:“ The Maximum LTV,CLTV,HCLTV for FRM is 95% and ARM is 95% for property usage Principal residence, transaction type purchase and number of units is one”####}

Note that in this case “++++” is the seperating-symbol while “####” works as a “end of the sequence” identifier.
Also note that the example prompt given by me some lines above is the only format of your training data which will be accepted by the model. (it might even be accepted without the seperator- and stop-symbol, but it generally performs better when applying those symbols)
Solutions to the problem of not being able to produce correct “JSONLed” training data were already given in other threads. (unfortunately cant post a link to another thread in here, but you will find them anyways when searching through this community)