Hi ! I have prepared my dataset (jsonl) to fine tune gpt-3.5-turbo-0613. My dataset satisfies all criteria posted on openAI website and is marked Ok when validating steps here (Github Chat_finetuning_data_prep.ipynb).
However, I get the error :
error: {
message: ‘invalid training_file: FineTuneSet’,
type: ‘invalid_request_error’,
param: ‘training_file’,
code: null
},
I don’t understand what could be wrong with my jsonl dataset. Here is for example the first line of it :
{“messages”:[{“role”:“system”,“content”:“TEXT.\n TEXT.\n TEXT.”},{“role”:“user”,“content”:“"""TEXT🚀.\r\nTEXT.\r\nTEXT""" \n\n###\n\n”},{“role”:“assistant”,“content”:“TEXT”}]}
Does anyone knows what is going on ? and what does a working dataset must look like ?
It looks like you’ve got a mish-mash of old instructions and new instructions there. The stop sequences are not required for chat models, and would actually cause big problems if you trained on that, because the user would have to type ### after every prompt to match up with training.
I would only send line feeds, no \r. That’s not keyboard input.
encode emoji escaped. → \u1f680
Is the user really going to type emoji anyway?
Your triple quotes have broken the input. Use single quotes and escape your quotes.
Is the user really going to type triple-quotes anyway?
system prompt should be what your bot will actually use
plain text without extra weird training symbols, unless that’s the actual input and output you want to use.
Every full conversation example: proper JSON on one line.