Invalid training_file, but passes Validation tests

I have a training jsonl file ready to be passed to fine tune a model based on gpt-3.5. My file has passed the validation tests as recommended in the API documentation. However I get an invalid training_file when I try to commence a fine-tune. What is the next step in validation. The file contains over 1000 lines.

1 Like

Every full JSON of messages must be its own line without line breaks. No special characters like tabs. All characters within strings fully-escaped so as not to break out.

You should be able to write a file parser that extracts individual lines by the linefeed character alone, and each passes as a valid JSON. Extract the individual messages from within and validate, Write a jsonschema that enforces rules. You can write a stripper that detects or cleans bad characters like tabs from the entire file.

Max example conversation is around 4000 tokens total.

I have reduced it to 4 lines and it still has the error so it is something obvious here they are:

{"messages": [{"role": "system", "content": "Plani is a factual chatbot that is mildly funny and a little sarcastic."}, {"role": "assistant", "content": "Yep, thats me!"}]}
{"messages": [{"role": "system", "content": "Plani is an export in the Planimate Simulation Software"}, {"role": "assistant", "content": "11000111101001!"}]}
{"messages": [{"role": "user", "content": "Here are some sample files that contain code from a random selection of code blocks"}, {"role": "assistant", "content": "Thank you for providing the sample files. I'll start by examining the content of these files to understand the Planimate code structure and context. After that, I'll be better equipped to assist you."}]}
{"messages": [{"role": "user", "content": "== Table Cell Click (TableClick) =="}, {"role": "assistant", "content": "Information about clicks and some keypresses on table cells  This is sent by a broadcast chosen by the modeller. These item attributes will be set:  *_click_type *_view_row *_view_col *_from_row *_from_col *_to_row *_to_col *_previous *_current *_row *_col *_contextclick *_buttons *_keycode *_dataid *_beforeedit  "}]}

The minimum is 10, 4 is “invalid training file”.

You should also watch to see the status of the training file, that you are actually trying to use it only after it has been processed, which can take hours.

ahhh… thanks for the tip. I tried again with 11 and recieved the same error. Can you see any issues with the formatting?.. It must have something to do with that. Not every line requires a user and system role I assume? One or the other will suffice?

{"messages": [{"role": "system", "content": "Plani is a factual chatbot that is mildly funny and a little sarcastic."}, {"role": "assistant", "content": "Yep, that\u00e2\u20ac\u2122s me!"}]}
{"messages": [{"role": "system", "content": "Plani is an export in the Planimate Simulation Software"}, {"role": "assistant", "content": "11000111101001!"}]}
{"messages": [{"role": "user", "content": "Here are some sample files that contain code from a random selection of code blocks"}, {"role": "assistant", "content": "Thank you for providing the sample files. I'll start by examining the content of these files to understand the Planimate code structure and context. After that, I'll be better equipped to assist you."}]}
{"messages": [{"role": "user", "content": "== Table Cell Click (TableClick) =="}, {"role": "assistant", "content": "Information about clicks and some keypresses on table cells  This is sent by a broadcast chosen by the modeller. These item attributes will be set:  *_click_type *_view_row *_view_col *_from_row *_from_col *_to_row *_to_col *_previous *_current *_row *_col *_contextclick *_buttons *_keycode *_dataid *_beforeedit  "}]}
{"messages": [{"role": "user", "content": "== Table Scroll Change (TableScroll) =="}, {"role": "assistant", "content": "Notification of a change in a table view's scroll  This is sent by a broadcast chosen by the modeller. These item attributes will be set:  *_rowscroll *_colscroll  "}]}
{"messages": [{"role": "user", "content": "== Train Graph Click (TrainClick) =="}, {"role": "assistant", "content": "Information about clicks in train graphs  This is sent by a broadcast chosen by the modeller. These item attributes will be set:  *_command *_row1 *_row2 *_time *_y  "}]}
{"messages": [{"role": "user", "content": "== Gantt Graph Click (GanttClick) =="}, {"role": "assistant", "content": "Information about clicks in gantt graphs  This is sent by a broadcast chosen by the modeller. These item attributes will be set:  *_command *_row1 *_row2 *_time *_y *_drag_type *_buttons  "}]}
{"messages": [{"role": "user", "content": "== Log Driven Graph Click (GraphClick) =="}, {"role": "assistant", "content": "Information about clicks in log driven graphs  This is sent by a broadcast chosen by the modeller. These item attributes will be set:  *_command *_time *_y *_buttons  "}]}
{"messages": [{"role": "user", "content": "== Grid View Click (GridClick) =="}, {"role": "assistant", "content": "Information about clicks in grid views  This is sent by a broadcast chosen by the modeller. These item attributes will be set:  *_row *_col  "}]}
{"messages": [{"role": "user", "content": "== Column Overlay Graph Click (OvGraphClick) =="}, {"role": "assistant", "content": "Information about clicks in column overlay graphs  This is sent by a broadcast chosen by the modeller. These item attributes will be set:  *_command *_row *_col *_x *_y  "}]}
{"messages": [{"role": "user", "content": "== Page Printed Broadcast (PagePrint) =="}, {"role": "assistant", "content": "Notification that a page was printed so another can be scheduled  This is sent when the engine generates broadcasts _Page Printed. These item attributes will be set:  *_page *_panel  "}]}

Well, it’s a good thing you got stopped, because you’re going to get two different AIs where the skills of one isn’t accessible by the other with that training.

All examples should be how you would use the AI. The system prompt you would use. The user input your user or your software would actually use.

No OpenAI examples have shown omission of a system message.

It looks like you’re trying to use the fine-tune as a knowledge retrieval system. Where only that exact “== message ==” could retrieve that output, and there is no inference to do in between. Not what fine-tune is best at. That doesn’t use machine learning to better inform how to weight word sequences into an answer pattern.

I have a schema message validator that I am using on my own OpenAI chat list → custom format, but it is on a different system, shouldn’t be too much to adapt it.

Not sure I completely follow you but it I think you are suggesting:

a) Have a system message in EACH of the records and having the same content could do.
b) the user content should have a message like “What would a table cell click return” and then the assistant would contain the current answer.

Do you think that would get closer to my goal?

== Log Driven Graph Click (GraphClick) ==

You mean that’s not what a user is going to type in as input to the AI? :stuck_out_tongue:

It seems like that should be part of the answer for the user question about what does what in software.

Appreciate the help, not sure I follow your question (sarcasm?). No its not and the exaples I have sent are a coding error and will correct it to the question not the statement of with the =='s. The above should read more like “What gets returned when you click on a Log Drive Graph?”

But I think this will get me going, no comment on the system addition so I assume I have that correct.

Yes, I have not seen anyone say “here’s the success I had by NOT including system messages”, so assume they are mandatory like the OpenAI examples – unless you really want to experiment with using your own AI with no system message either.

Don’t train the AI on inputs the AI will never see. Train it on the inputs it will receive and the outputs it should produce in response to them.

1 Like

Welcome to the community @JayOz

Please make sure your data is utf-8 encoded.

You can validate your JSONL file before you upload it using this.

1 Like

Yeh did that, thank you. It gave a no error report and said that 2 had no user role and ‘many’ had no system role so I had assumed they were not errors but statements of information about my data.

It was very handy though as my first pass did have 3 (of >1000) errors with zero content.

:+1:

Here is my latest try, still get an invalid file, argh!

{"messages": [{"role": "system", "content": "Plani is a factual chatbot that is mildly funny and a little sarcastic."}, {"role": "user", "content": "What are you?"}, {"role": "assistant", "content": "I am a trained AI that should be able to assist you with your Planimte enquires."}]}
{"messages": [{"role": "system", "content": "Plani is a factual chatbot that is mildly funny and a little sarcastic."}, {"role": "user", "content": "Here are some sample files that contain code from a random selection of code blocks"}, {"role": "assistant", "content": "Thank you for providing the sample files. I'll start by examining the content of these files to understand the Planimate code structure and context. After that, I'll be better equipped to assist you."}]}
{"messages": [{"role": "system", "content": "Plani is a factual chatbot that is mildly funny and a little sarcastic."}, {"role": "user", "content": "What happens when I click on the Table Cell  (Table)?"}, {"role": "assistant", "content": "Information about clicks and some keypresses on table cells  This is sent by a broadcast chosen by the modeller. These item attributes will be set:  *_click_type *_view_row *_view_col *_from_row *_from_col *_to_row *_to_col *_previous *_current *_row *_col *_contextclick *_buttons *_keycode *_dataid *_beforeedit  "}]}
{"messages": [{"role": "system", "content": "Plani is a factual chatbot that is mildly funny and a little sarcastic."}, {"role": "user", "content": "What happens when I click on the Table Scroll Change (TableScroll)?"}, {"role": "assistant", "content": "Notification of a change in a table view's scroll  This is sent by a broadcast chosen by the modeller. These item attributes will be set:  *_rowscroll *_colscroll  "}]}
{"messages": [{"role": "system", "content": "Plani is a factual chatbot that is mildly funny and a little sarcastic."}, {"role": "user", "content": "What happens when I click on the Train Graph  (Train)?"}, {"role": "assistant", "content": "Information about clicks in train graphs  This is sent by a broadcast chosen by the modeller. These item attributes will be set:  *_command *_row1 *_row2 *_time *_y  "}]}
{"messages": [{"role": "system", "content": "Plani is a factual chatbot that is mildly funny and a little sarcastic."}, {"role": "user", "content": "What happens when I click on the Gantt Graph  (Gantt)?"}, {"role": "assistant", "content": "Information about clicks in gantt graphs  This is sent by a broadcast chosen by the modeller. These item attributes will be set:  *_command *_row1 *_row2 *_time *_y *_drag_type *_buttons  "}]}
{"messages": [{"role": "system", "content": "Plani is a factual chatbot that is mildly funny and a little sarcastic."}, {"role": "user", "content": "What happens when I click on the Log Driven Graph  (Graph)?"}, {"role": "assistant", "content": "Information about clicks in log driven graphs  This is sent by a broadcast chosen by the modeller. These item attributes will be set:  *_command *_time *_y *_buttons  "}]}
{"messages": [{"role": "system", "content": "Plani is a factual chatbot that is mildly funny and a little sarcastic."}, {"role": "user", "content": "What happens when I click on the Grid View  (Grid)?"}, {"role": "assistant", "content": "Information about clicks in grid views  This is sent by a broadcast chosen by the modeller. These item attributes will be set:  *_row *_col  "}]}
{"messages": [{"role": "system", "content": "Plani is a factual chatbot that is mildly funny and a little sarcastic."}, {"role": "user", "content": "What happens when I click on the Column Overlay Graph  (OvGraph)?"}, {"role": "assistant", "content": "Information about clicks in column overlay graphs  This is sent by a broadcast chosen by the modeller. These item attributes will be set:  *_command *_row *_col *_x *_y  "}]}
{"messages": [{"role": "system", "content": "Plani is a factual chatbot that is mildly funny and a little sarcastic."}, {"role": "user", "content": "What happens when I click on the Page Printed Broadcast (PagePrint)?"}, {"role": "assistant", "content": "Notification that a page was printed so another can be scheduled  This is sent when the engine generates broadcasts _Page Printed. These item attributes will be set:  *_page *_panel  "}]}
{"messages": [{"role": "system", "content": "Plani is a factual chatbot that is mildly funny and a little sarcastic."}, {"role": "user", "content": "What happens when I click on the Attribute View  (AttributeEdit)?"}, {"role": "assistant", "content": "Notification that an attribute has been edited  This is sent by a broadcast chosen by the modeller. These item attributes will be set:  *_previous *_current  "}]}

Validation:

Num examples: 11
First example:
{'role': 'system', 'content': 'Plani is a factual chatbot that is mildly funny and a little sarcastic.'}
{'role': 'user', 'content': 'What are you?'}
{'role': 'assistant', 'content': 'I am a trained AI that should be able to assist you with your Planimte enquires.'}
-----------------
NO ERRORS FOUND
-----------------
Num examples missing system message: 0
Num examples missing user message: 0

#### Distribution of num_messages_per_example:
min / max: 3, 3
mean / median: 3.0, 3.0
p5 / p95: 3.0, 3.0

#### Distribution of num_total_tokens_per_example:
min / max: 56, 121
mean / median: 85.0909090909091, 84.0
p5 / p95: 77.0, 93.0

#### Distribution of num_assistant_tokens_per_example:
min / max: 20, 76
mean / median: 39.81818181818182, 38.0
p5 / p95: 32.0, 46.0

0 examples may be over the 4096 token limit, they will be truncated during fine-tuning
Dataset has ~936 tokens that will be charged for during training
By default, you'll train for 9 epochs on this dataset
By default, you'll be charged for ~8424 tokens

You are trying to turn over these examples to the forum a lot faster than the API and queue would typically process files.

Check the status of your file uploads with the files endpoint object results:

https://platform.openai.com/docs/api-reference/files/list

The process is rejected and doesn’t even start as a result of the error so there are no jobs to report on. To be sure I followed your advice, and nothing was returned.

I take your 11 example. Name it xxxxx.jsonl

Uploaded fine.

/listfiles.py ==================

{
“object”: “list”,
“data”: [
{
“object”: “file”,
“id”: “file-99999999”,
“purpose”: “fine-tune”,
“filename”: “file”,
“bytes”: 4684,
“created_at”: 1695889490,
“status”: “processed”,
“status_details”: null
}
]
}

runtune.py ===================

> print(created)

{
“object”: “fine_tuning.job”,
“id”: “ftjob-NMAhOkwFOCCkFviSAFeLMtyL”,
“model”: “gpt-3.5-turbo-0613”,
“created_at”: 1695890098,
“finished_at”: null,
“fine_tuned_model”: null,
“organization_id”: “org-111112313141214”,
“result_files”: ,
“status”: “validating_files”,
“validation_file”: null,
“training_file”: “file-99999999”,
“hyperparameters”: {
“n_epochs”: 1
},
“trained_tokens”: null,
“error”: null
}

joblist.py ===================

{
“object”: “list”,
“data”: [
{
“object”: “fine_tuning.job”,
“id”: “ftjob-NMAhOkwFOCCkFviSAFeLMtyL”,
“model”: “gpt-3.5-turbo-0613”,
“created_at”: 1695890098,
“finished_at”: null,
“fine_tuned_model”: null,
“organization_id”: “org-ZHb7VWRPmJiJA45Tdxu3TeT8”,
“result_files”: ,
“status”: “running”,
“validation_file”: null,
“training_file”: “file-90N1hCWp2Qd26qCS5CmSTwdz”,
“hyperparameters”: {
“n_epochs”: 1
},
“trained_tokens”: null,
“error”: null
}
],
“has_more”: false
}

joblist.py ===================

{
“object”: “list”,
“data”: [
{
“object”: “fine_tuning.job”,
“id”: “ftjob-NMAhOkwFOCCkFviSAFeLMtyL”,
“model”: “gpt-3.5-turbo-0613”,
“created_at”: 1695890098,
“finished_at”: 1695890276,
“fine_tuned_model”: “ft:gpt-3.5-turbo-0613:orgname::8888888”,
“organization_id”: “org-8888888”,
“result_files”: [
“file-axRASp9MYjcYP8DWLlWQi9gD”
],
“status”: “succeeded”,
“validation_file”: null,
“training_file”: “file-9999999”,
“hyperparameters”: {
“n_epochs”: 1
},
“trained_tokens”: 914,
“error”: null
}
],
“has_more”: false
}

use

Untitled

A penny to make a silly model. With system prompt, it answers “what are you” differently than gpt-3.5-turbo, but not much.

Same problem here. It passes from the official validation scripts but fails during fine-tuning creation.

openai.FineTuningJob.create(training_file="file-abc123", model="gpt-3.5-turbo")
openai.error.InvalidRequestError: invalid training_file: file-abc123

This is the code I am using for fine-tuning.

import os
import openai
openai.api_key = os.getenv("OPENAI_API_KEY")
openai.File.create(
  file=open("mydata.jsonl", "rb"),
  purpose='fine-tune'
)
openai.FineTuningJob.create(training_file="file-abc123", model="gpt-3.5-turbo")

This is what data looks like. It repeats at every line 100 times.

{"messages": [{"role": "system", "content": "be nice"}, {"role": "user", "content": "testing"}, {"role": "assistant", "content": "ok"}]}

Are you checking the file upload and processing status through the files endpoint to see that your file is even ready to tune on? There’s been hour+ delays today.

1 Like

Ohhh it works now. Thank you!
Apparently, I was misunderstanding what training_file should be. I didn’t know the system was creating a random file name for the uploaded files and we have to check that name using

openai.File.list()

I think the documentation could be more clear about it.

openai.FineTuningJob.create(training_file="file-abc123", model="gpt-3.5-turbo")

sounds like we can give a random file name ourselves at this step even if we didn’t give any user-defined file name in openai.File.create.

2 Likes

Glad to see you got through an unexpected step.

I was thinking of writing a python script that does the whole process from file, to status monitoring, to training, to a model check when it shows up through the models endpoint, but you reminded that it would have to figure out the right upload file name also.

2 Likes