I have a training jsonl file ready to be passed to fine tune a model based on gpt-3.5. My file has passed the validation tests as recommended in the API documentation. However I get an invalid training_file when I try to commence a fine-tune. What is the next step in validation. The file contains over 1000 lines.
Every full JSON of messages must be its own line without line breaks. No special characters like tabs. All characters within strings fully-escaped so as not to break out.
You should be able to write a file parser that extracts individual lines by the linefeed character alone, and each passes as a valid JSON. Extract the individual messages from within and validate, Write a jsonschema that enforces rules. You can write a stripper that detects or cleans bad characters like tabs from the entire file.
Max example conversation is around 4000 tokens total.
I have reduced it to 4 lines and it still has the error so it is something obvious here they are:
{"messages": [{"role": "system", "content": "Plani is a factual chatbot that is mildly funny and a little sarcastic."}, {"role": "assistant", "content": "Yep, thats me!"}]}
{"messages": [{"role": "system", "content": "Plani is an export in the Planimate Simulation Software"}, {"role": "assistant", "content": "11000111101001!"}]}
{"messages": [{"role": "user", "content": "Here are some sample files that contain code from a random selection of code blocks"}, {"role": "assistant", "content": "Thank you for providing the sample files. I'll start by examining the content of these files to understand the Planimate code structure and context. After that, I'll be better equipped to assist you."}]}
{"messages": [{"role": "user", "content": "== Table Cell Click (TableClick) =="}, {"role": "assistant", "content": "Information about clicks and some keypresses on table cells This is sent by a broadcast chosen by the modeller. These item attributes will be set: *_click_type *_view_row *_view_col *_from_row *_from_col *_to_row *_to_col *_previous *_current *_row *_col *_contextclick *_buttons *_keycode *_dataid *_beforeedit "}]}
The minimum is 10, 4 is âinvalid training fileâ.
You should also watch to see the status of the training file, that you are actually trying to use it only after it has been processed, which can take hours.
ahhh⌠thanks for the tip. I tried again with 11 and recieved the same error. Can you see any issues with the formatting?.. It must have something to do with that. Not every line requires a user and system role I assume? One or the other will suffice?
{"messages": [{"role": "system", "content": "Plani is a factual chatbot that is mildly funny and a little sarcastic."}, {"role": "assistant", "content": "Yep, that\u00e2\u20ac\u2122s me!"}]}
{"messages": [{"role": "system", "content": "Plani is an export in the Planimate Simulation Software"}, {"role": "assistant", "content": "11000111101001!"}]}
{"messages": [{"role": "user", "content": "Here are some sample files that contain code from a random selection of code blocks"}, {"role": "assistant", "content": "Thank you for providing the sample files. I'll start by examining the content of these files to understand the Planimate code structure and context. After that, I'll be better equipped to assist you."}]}
{"messages": [{"role": "user", "content": "== Table Cell Click (TableClick) =="}, {"role": "assistant", "content": "Information about clicks and some keypresses on table cells This is sent by a broadcast chosen by the modeller. These item attributes will be set: *_click_type *_view_row *_view_col *_from_row *_from_col *_to_row *_to_col *_previous *_current *_row *_col *_contextclick *_buttons *_keycode *_dataid *_beforeedit "}]}
{"messages": [{"role": "user", "content": "== Table Scroll Change (TableScroll) =="}, {"role": "assistant", "content": "Notification of a change in a table view's scroll This is sent by a broadcast chosen by the modeller. These item attributes will be set: *_rowscroll *_colscroll "}]}
{"messages": [{"role": "user", "content": "== Train Graph Click (TrainClick) =="}, {"role": "assistant", "content": "Information about clicks in train graphs This is sent by a broadcast chosen by the modeller. These item attributes will be set: *_command *_row1 *_row2 *_time *_y "}]}
{"messages": [{"role": "user", "content": "== Gantt Graph Click (GanttClick) =="}, {"role": "assistant", "content": "Information about clicks in gantt graphs This is sent by a broadcast chosen by the modeller. These item attributes will be set: *_command *_row1 *_row2 *_time *_y *_drag_type *_buttons "}]}
{"messages": [{"role": "user", "content": "== Log Driven Graph Click (GraphClick) =="}, {"role": "assistant", "content": "Information about clicks in log driven graphs This is sent by a broadcast chosen by the modeller. These item attributes will be set: *_command *_time *_y *_buttons "}]}
{"messages": [{"role": "user", "content": "== Grid View Click (GridClick) =="}, {"role": "assistant", "content": "Information about clicks in grid views This is sent by a broadcast chosen by the modeller. These item attributes will be set: *_row *_col "}]}
{"messages": [{"role": "user", "content": "== Column Overlay Graph Click (OvGraphClick) =="}, {"role": "assistant", "content": "Information about clicks in column overlay graphs This is sent by a broadcast chosen by the modeller. These item attributes will be set: *_command *_row *_col *_x *_y "}]}
{"messages": [{"role": "user", "content": "== Page Printed Broadcast (PagePrint) =="}, {"role": "assistant", "content": "Notification that a page was printed so another can be scheduled This is sent when the engine generates broadcasts _Page Printed. These item attributes will be set: *_page *_panel "}]}
Well, itâs a good thing you got stopped, because youâre going to get two different AIs where the skills of one isnât accessible by the other with that training.
All examples should be how you would use the AI. The system prompt you would use. The user input your user or your software would actually use.
No OpenAI examples have shown omission of a system message.
It looks like youâre trying to use the fine-tune as a knowledge retrieval system. Where only that exact â== message ==â could retrieve that output, and there is no inference to do in between. Not what fine-tune is best at. That doesnât use machine learning to better inform how to weight word sequences into an answer pattern.
I have a schema message validator that I am using on my own OpenAI chat list â custom format, but it is on a different system, shouldnât be too much to adapt it.
Not sure I completely follow you but it I think you are suggesting:
a) Have a system message in EACH of the records and having the same content could do.
b) the user content should have a message like âWhat would a table cell click returnâ and then the assistant would contain the current answer.
Do you think that would get closer to my goal?
== Log Driven Graph Click (GraphClick) ==
You mean thatâs not what a user is going to type in as input to the AI?
It seems like that should be part of the answer for the user question about what does what in software.
Appreciate the help, not sure I follow your question (sarcasm?). No its not and the exaples I have sent are a coding error and will correct it to the question not the statement of with the =='s. The above should read more like âWhat gets returned when you click on a Log Drive Graph?â
But I think this will get me going, no comment on the system addition so I assume I have that correct.
Yes, I have not seen anyone say âhereâs the success I had by NOT including system messagesâ, so assume they are mandatory like the OpenAI examples â unless you really want to experiment with using your own AI with no system message either.
Donât train the AI on inputs the AI will never see. Train it on the inputs it will receive and the outputs it should produce in response to them.
Welcome to the community @JayOz
Please make sure your data is utf-8
encoded.
You can validate your JSONL file before you upload it using this.
Yeh did that, thank you. It gave a no error report and said that 2 had no user role and âmanyâ had no system role so I had assumed they were not errors but statements of information about my data.
It was very handy though as my first pass did have 3 (of >1000) errors with zero content.
Here is my latest try, still get an invalid file, argh!
{"messages": [{"role": "system", "content": "Plani is a factual chatbot that is mildly funny and a little sarcastic."}, {"role": "user", "content": "What are you?"}, {"role": "assistant", "content": "I am a trained AI that should be able to assist you with your Planimte enquires."}]}
{"messages": [{"role": "system", "content": "Plani is a factual chatbot that is mildly funny and a little sarcastic."}, {"role": "user", "content": "Here are some sample files that contain code from a random selection of code blocks"}, {"role": "assistant", "content": "Thank you for providing the sample files. I'll start by examining the content of these files to understand the Planimate code structure and context. After that, I'll be better equipped to assist you."}]}
{"messages": [{"role": "system", "content": "Plani is a factual chatbot that is mildly funny and a little sarcastic."}, {"role": "user", "content": "What happens when I click on the Table Cell (Table)?"}, {"role": "assistant", "content": "Information about clicks and some keypresses on table cells This is sent by a broadcast chosen by the modeller. These item attributes will be set: *_click_type *_view_row *_view_col *_from_row *_from_col *_to_row *_to_col *_previous *_current *_row *_col *_contextclick *_buttons *_keycode *_dataid *_beforeedit "}]}
{"messages": [{"role": "system", "content": "Plani is a factual chatbot that is mildly funny and a little sarcastic."}, {"role": "user", "content": "What happens when I click on the Table Scroll Change (TableScroll)?"}, {"role": "assistant", "content": "Notification of a change in a table view's scroll This is sent by a broadcast chosen by the modeller. These item attributes will be set: *_rowscroll *_colscroll "}]}
{"messages": [{"role": "system", "content": "Plani is a factual chatbot that is mildly funny and a little sarcastic."}, {"role": "user", "content": "What happens when I click on the Train Graph (Train)?"}, {"role": "assistant", "content": "Information about clicks in train graphs This is sent by a broadcast chosen by the modeller. These item attributes will be set: *_command *_row1 *_row2 *_time *_y "}]}
{"messages": [{"role": "system", "content": "Plani is a factual chatbot that is mildly funny and a little sarcastic."}, {"role": "user", "content": "What happens when I click on the Gantt Graph (Gantt)?"}, {"role": "assistant", "content": "Information about clicks in gantt graphs This is sent by a broadcast chosen by the modeller. These item attributes will be set: *_command *_row1 *_row2 *_time *_y *_drag_type *_buttons "}]}
{"messages": [{"role": "system", "content": "Plani is a factual chatbot that is mildly funny and a little sarcastic."}, {"role": "user", "content": "What happens when I click on the Log Driven Graph (Graph)?"}, {"role": "assistant", "content": "Information about clicks in log driven graphs This is sent by a broadcast chosen by the modeller. These item attributes will be set: *_command *_time *_y *_buttons "}]}
{"messages": [{"role": "system", "content": "Plani is a factual chatbot that is mildly funny and a little sarcastic."}, {"role": "user", "content": "What happens when I click on the Grid View (Grid)?"}, {"role": "assistant", "content": "Information about clicks in grid views This is sent by a broadcast chosen by the modeller. These item attributes will be set: *_row *_col "}]}
{"messages": [{"role": "system", "content": "Plani is a factual chatbot that is mildly funny and a little sarcastic."}, {"role": "user", "content": "What happens when I click on the Column Overlay Graph (OvGraph)?"}, {"role": "assistant", "content": "Information about clicks in column overlay graphs This is sent by a broadcast chosen by the modeller. These item attributes will be set: *_command *_row *_col *_x *_y "}]}
{"messages": [{"role": "system", "content": "Plani is a factual chatbot that is mildly funny and a little sarcastic."}, {"role": "user", "content": "What happens when I click on the Page Printed Broadcast (PagePrint)?"}, {"role": "assistant", "content": "Notification that a page was printed so another can be scheduled This is sent when the engine generates broadcasts _Page Printed. These item attributes will be set: *_page *_panel "}]}
{"messages": [{"role": "system", "content": "Plani is a factual chatbot that is mildly funny and a little sarcastic."}, {"role": "user", "content": "What happens when I click on the Attribute View (AttributeEdit)?"}, {"role": "assistant", "content": "Notification that an attribute has been edited This is sent by a broadcast chosen by the modeller. These item attributes will be set: *_previous *_current "}]}
Validation:
Num examples: 11
First example:
{'role': 'system', 'content': 'Plani is a factual chatbot that is mildly funny and a little sarcastic.'}
{'role': 'user', 'content': 'What are you?'}
{'role': 'assistant', 'content': 'I am a trained AI that should be able to assist you with your Planimte enquires.'}
-----------------
NO ERRORS FOUND
-----------------
Num examples missing system message: 0
Num examples missing user message: 0
#### Distribution of num_messages_per_example:
min / max: 3, 3
mean / median: 3.0, 3.0
p5 / p95: 3.0, 3.0
#### Distribution of num_total_tokens_per_example:
min / max: 56, 121
mean / median: 85.0909090909091, 84.0
p5 / p95: 77.0, 93.0
#### Distribution of num_assistant_tokens_per_example:
min / max: 20, 76
mean / median: 39.81818181818182, 38.0
p5 / p95: 32.0, 46.0
0 examples may be over the 4096 token limit, they will be truncated during fine-tuning
Dataset has ~936 tokens that will be charged for during training
By default, you'll train for 9 epochs on this dataset
By default, you'll be charged for ~8424 tokens
You are trying to turn over these examples to the forum a lot faster than the API and queue would typically process files.
Check the status of your file uploads with the files endpoint object results:
The process is rejected and doesnât even start as a result of the error so there are no jobs to report on. To be sure I followed your advice, and nothing was returned.
I take your 11 example. Name it xxxxx.jsonl
Uploaded fine.
/listfiles.py ==================
{
âobjectâ: âlistâ,
âdataâ: [
{
âobjectâ: âfileâ,
âidâ: âfile-99999999â,
âpurposeâ: âfine-tuneâ,
âfilenameâ: âfileâ,
âbytesâ: 4684,
âcreated_atâ: 1695889490,
âstatusâ: âprocessedâ,
âstatus_detailsâ: null
}
]
}
runtune.py ===================
> print(created)
{
âobjectâ: âfine_tuning.jobâ,
âidâ: âftjob-NMAhOkwFOCCkFviSAFeLMtyLâ,
âmodelâ: âgpt-3.5-turbo-0613â,
âcreated_atâ: 1695890098,
âfinished_atâ: null,
âfine_tuned_modelâ: null,
âorganization_idâ: âorg-111112313141214â,
âresult_filesâ: ,
âstatusâ: âvalidating_filesâ,
âvalidation_fileâ: null,
âtraining_fileâ: âfile-99999999â,
âhyperparametersâ: {
ân_epochsâ: 1
},
âtrained_tokensâ: null,
âerrorâ: null
}
joblist.py ===================
{
âobjectâ: âlistâ,
âdataâ: [
{
âobjectâ: âfine_tuning.jobâ,
âidâ: âftjob-NMAhOkwFOCCkFviSAFeLMtyLâ,
âmodelâ: âgpt-3.5-turbo-0613â,
âcreated_atâ: 1695890098,
âfinished_atâ: null,
âfine_tuned_modelâ: null,
âorganization_idâ: âorg-ZHb7VWRPmJiJA45Tdxu3TeT8â,
âresult_filesâ: ,
âstatusâ: ârunningâ,
âvalidation_fileâ: null,
âtraining_fileâ: âfile-90N1hCWp2Qd26qCS5CmSTwdzâ,
âhyperparametersâ: {
ân_epochsâ: 1
},
âtrained_tokensâ: null,
âerrorâ: null
}
],
âhas_moreâ: false
}
joblist.py ===================
{
âobjectâ: âlistâ,
âdataâ: [
{
âobjectâ: âfine_tuning.jobâ,
âidâ: âftjob-NMAhOkwFOCCkFviSAFeLMtyLâ,
âmodelâ: âgpt-3.5-turbo-0613â,
âcreated_atâ: 1695890098,
âfinished_atâ: 1695890276,
âfine_tuned_modelâ: âft:gpt-3.5-turbo-0613:orgname::8888888â,
âorganization_idâ: âorg-8888888â,
âresult_filesâ: [
âfile-axRASp9MYjcYP8DWLlWQi9gDâ
],
âstatusâ: âsucceededâ,
âvalidation_fileâ: null,
âtraining_fileâ: âfile-9999999â,
âhyperparametersâ: {
ân_epochsâ: 1
},
âtrained_tokensâ: 914,
âerrorâ: null
}
],
âhas_moreâ: false
}
use
A penny to make a silly model. With system prompt, it answers âwhat are youâ differently than gpt-3.5-turbo, but not much.
Same problem here. It passes from the official validation scripts but fails during fine-tuning creation.
openai.FineTuningJob.create(training_file="file-abc123", model="gpt-3.5-turbo")
openai.error.InvalidRequestError: invalid training_file: file-abc123
This is the code I am using for fine-tuning.
import os
import openai
openai.api_key = os.getenv("OPENAI_API_KEY")
openai.File.create(
file=open("mydata.jsonl", "rb"),
purpose='fine-tune'
)
openai.FineTuningJob.create(training_file="file-abc123", model="gpt-3.5-turbo")
This is what data looks like. It repeats at every line 100 times.
{"messages": [{"role": "system", "content": "be nice"}, {"role": "user", "content": "testing"}, {"role": "assistant", "content": "ok"}]}
Are you checking the file upload and processing status through the files endpoint to see that your file is even ready to tune on? Thereâs been hour+ delays today.
Ohhh it works now. Thank you!
Apparently, I was misunderstanding what training_file
should be. I didnât know the system was creating a random file name for the uploaded files and we have to check that name using
openai.File.list()
I think the documentation could be more clear about it.
openai.FineTuningJob.create(training_file="file-abc123", model="gpt-3.5-turbo")
sounds like we can give a random file name ourselves at this step even if we didnât give any user-defined file name in openai.File.create
.
Glad to see you got through an unexpected step.
I was thinking of writing a python script that does the whole process from file, to status monitoring, to training, to a model check when it shows up through the models endpoint, but you reminded that it would have to figure out the right upload file name also.