Fine tuning error: The job failed due to an invalid training file. Unexpected file format, expected either prompt/completion pairs or chat messages

I’m currently trying to fine-tune a GPT-3.5-Turbo model (0125) using the fine tuning API. I’ve read through the documentation provided by OpenAI and have formatted my code as a JSONL file but every time I try to fine tune my model it fails during the validation stage and returns the error message:

The job failed due to an invalid training file. Unexpected file format, expected either prompt/completion pairs or chat messages.

My json file is in the following format:

Which then looks like this after being converted to JSON Line format:
{“messages”:[{“role”:“user”,“content”:“Expected user input”},{“role”:“assistant”,“content”:“Expected outcome of the AI”}]}

I’m not sure where I’m going wrong as it all looks correct and used GPT-4 to help debug and find any errors with my JSONL file. Any help / advice is appreciated

Welcome to the Community!

One full training example encompassing both the user and assistant message must be in a single line in the JSONL file.

Based on what you shared above, your user and assistant messages seem to be in two separate lines, which would explain the error.

2 Likes

That’s just with how it was copied across. ChatGPT displays it as a singular line but when I copy it, it spreads across two lines.

1 Like

Hm. the error is not obvious to the eye based on what you posted.

But if you can code you can use the following cookbook which includes a script for fine-tuning data format validation:

2 Likes

I’ve ran the suggested code with my JSONL file and it outputs this to the console:

Screenshot 2024-04-24 at 15.39.33

I’m not sure if the messages need a system message as I was under the impression that fine-tuning a model looks at the writing style / format of outputs rather then the personality of the prompt. I’d provide a link to my replit but I’m unable to share links as a new user

1 Like

To my understanding it is not a requirement. I’ve personally only fine-tuned with a system message, so don’t have a direct data point whether this could be linked to the error. I can briefly check later tonight though if it is not resolved by then.

2 Likes

If you are fine-tuning, I think you need to first describe the system in the role.

Like this:

{"messages": [{"role": "system", "content": "Marv is a factual chatbot that is also sarcastic."}, {"role": "user", "content": "What's the capital of France?"}, {"role": "assistant", "content": "Paris, a

I’m sorry, the information that I provided was incorrect.

3 Likes

Yeah, I also just tested it to rule it out as a possibility but my files without system messages were accepted.

3 Likes

I’ve tried a few new things today such as creating a CSV file which I convert into JSONL.:

{"messages":[{"role":"system","content":"You are a helpful assistant"}]}
{"messages":[{"role":"user","content":"What's the capital of France?"}]}
{"messages":[{"role":"assistant","content":"Paris"}]}
{"messages":[{"role":"system","content":"You are a helpful assistant"}]}
{"messages":[{"role":"user","content":"Who's the current King of England?"}]}
{"messages":[{"role":"assistant","content":"Charles"}]}
{"messages":[{"role":"system","content":"You are a helpful assistant"}]}
{"messages":[{"role":"user","content":"What's 1+1?"}]}
{"messages":[{"role":"assistant","content":"2"}]}
{"messages":[{"role":"system","content":"You are a helpful assistant"}]}
{"messages":[{"role":"user","content":"What colour is the sky?"}]}
{"messages":[{"role":"assistant","content":"Blue"}]}
{"messages":[{"role":"system","content":"You are a helpful assistant"}]}
{"messages":[{"role":"user","content":"How many fingers on a hand?"}]}
{"messages":[{"role":"assistant","content":"5"}]}
{"messages":[{"role":"system","content":"You are a helpful assistant"}]}
{"messages":[{"role":"user","content":"How many days in a week?"}]}
{"messages":[{"role":"assistant","content":"7"}]}
{"messages":[{"role":"system","content":"You are a helpful assistant"}]}
{"messages":[{"role":"user","content":"What is the last month of the year?"}]}
{"messages":[{"role":"assistant","content":"December"}]}
{"messages":[{"role":"system","content":"You are a helpful assistant"}]}
{"messages":[{"role":"user","content":"Who works in a school?"}]}
{"messages":[{"role":"assistant","content":"Teachers"}]}
{"messages":[{"role":"system","content":"You are a helpful assistant"}]}
{"messages":[{"role":"user","content":"How many sides does a square have?"}]}
{"messages":[{"role":"assistant","content":"4"}]}
{"messages":[{"role":"system","content":"You are a helpful assistant"}]}
{"messages":[{"role":"user","content":"Hello AI"}]}
{"messages":[{"role":"assistant","content":"Hello User"}]}
{"messages":[{"role":"system","content":"You are a helpful assistant"}]}
{"messages":[{"role":"user","content":"What is in the ocean?"}]}
{"messages":[{"role":"assistant","content":"Water"}]}

When I run this through the fine-tuner, I now get the following message instead:

The job failed due to an invalid training file. Invalid file format. Line 1, key “messages”: At least one message must be from the assistant

Still not ideal but a different error message is at least some form of progress

As indicated earlier, one full example must be in one line. Assuming your first three lines constitute one full example, you need to re-format them as follows in order to be considered valid:

{"messages": [{"role": "system", "content": "You are a helpful assistant"}, {"role": "user", "content": "What's the capital of France?"}, {"role": "assistant", "content": "Paris"}]}

You can leave out the system message if that is not required. Regardless, all messages (i.e. system, user and assistant messages) must be in the same line, following the format shown here.

1 Like

Apologies, I’ve put it into the format and now it’s returning this error messgae:

The job failed due to an invalid training file. Expected file to have JSONL format, where every line is a valid JSON dictionary. Line 1 is not a dictionary.

Below you’ll find my full JSONL file which I’m using for the project

{"messages": [{"role": "user", "content": "Generate a report for a student who has done the following: Performed well this academic year both in lesson and exams but could improve further by revising questions before the final exams"},{"role": "assistant", "content": "In {SUBJECT} throughout the year, {STUDENT} has made steady and consistent progress with both in class and independent work completed. {HIS/HER/THEIR} performance in the recent computer systems end-of-topic test was particularly pleasing, especially given the technical nature of the material. I would encourage {HE/SHE/THEY} to ensure {HIS/HER/THEIR} preparation for the Summer Examinations includes answering plenty of past examination questions and carefully noting significant keywords used in the mark schemes."}]}
{"messages": [{"role": "user", "content": "Generate a report for a student who has done the following: Shown exceptional talent in the subject area but has some gaps in knowledge due to absences and could do with some revision"},{"role": "assistant", "content": "{STUDENT} is a highly capable student of {SUBJECT} who has great potential in this subject. {HE/SHE/THEY} has a sound prior understanding of {SUBJECT} knowledge and will actively participate in class activities. However, {STUDENT} has some gaps in {HIS/HER/THEIR} theoretical knowledge of topics covered so far due to absences. To help {HIS/HER/THEIR} achieve moving forward, {HE/SHE/THEY} needs to focus on {HIS/HER/THEIR} approach to revision, which is going to be required for the forthcoming Year {YEAR} Examinations. {HE/SHE/THEY} should ensure {HE/SHE/THEY} learns keyword answers and bullet points for written answers, as well as practising lots of examination-style questions including carefully noting details expected in the mark scheme."}]}
{"messages": [{"role": "user", "content": "Generate a report for a student who has done the following: Performed well in lesson by participating but could be pushed further with more challenging work"},{"role": "assistant", "content": "From my observation, {STUDENT} has a very good understanding of the {SUBJECT} concepts covered so far. {HE/SHE/THEY} has impressed by actively participating in class and asking questions when necessary. To further develop {HIS/HER/THEIR} {SUBJECT} skills, I plan to introduce more challenging concepts and projects to push {STUDENT}. Approaching all tasks in the same methodical way {HE/SHE/THEY} has developed {HIS/HER/THEIR} course notes will certainly aid his revision for the upcoming Year {YEAR} Examinations."}]}
{"messages": [{"role": "user", "content": "Generate a report for a student who has done the following: performed exceptionally well this year with clear improvements in the subject but could develop a better revision strategy to improve further"},{"role": "assistant", "content": "{STUDENT} has really impressed me this academic year, making excellent progress with clear improvements with every piece of work {HE/SHE/THEY} undertakes. {HIS/HER/THEIR} topic assessments have demonstrated how {HE/SHE/THEY} understands how to answer examination questions to access the full range of the marks available. By developing {HIS/HER/THEIR} revision strategy, {HE/SHE/THEY} will reduce the challenge of recalling the full course when {HE/SHE/THEY} sits {HIS/HERS/THEIR} Mock Examinations and final GCSE Examinations in Summer."}]}
{"messages": [{"role": "user", "content": "Generate a report for a student who has done the following: impressive subject knowledge and high quality work but needs to revise ahead of the upcoming Summer Internal Examination"},{"role": "assistant", "content": "I have been thoroughly impressed with {STUDENT}’s conscientious and hardworking approach this academic year, reflected in the standard of work that {HE/SHE/THEY} has produced. {HIS/HER/THEIR} responses to assessment questions indicate that {HE/SHE/THEY} must continue with a diligent approach to revision and ensure {HIS/HER/THEIR} notes are checked for accuracy. Focus must be on learning specific definitions which will certainly enable {HIM/HER/THEM} to demonstrate {HIS/HER/THEIR} technical understanding with greater precision in the upcoming Summer Internal Examinations."}]}
{"messages": [{"role": "user", "content": "Generate a report for a student who has done the following: enthusiastic in lessons and shows lots of potential in the subject but needs to focus on revision due to missing some lessons"},{"role": "assistant", "content": "{STUDENT} is an enthusiastic student who has potential in {SUBJECT}. {HE/SHE/THEY} understands the fundamentals of {SUBJECT} topics covered in class and will actively participate in lessons. However, {STUDENT} has some gaps in {HIS/HER/THEIR} theoretical knowledge of topics covered so far due to absences. To help {HIM/HER/THEM} achieve moving forward, {HE/SHE/THEY} needs to focus on {HIS/HER/THEIR} approach to revision, which is going to be required for the forthcoming Year {YEAR} Examinations. {HE/SHE/THEY} should ensure {HE/SHE/THEY} learns keyword answers and bullet points for written answers, as well as practising lots of examination-style questions including carefully noting details expected in the mark scheme."}]}
{"messages": [{"role": "user", "content": "Generate a report for a student who has done the following: continues to gain independence in lesson by focusing on questions during lesson and is exploring newer ways of solving problems but needs to work on revision ahead of any examinations by learning key terminology"},{"role": "assistant", "content": "During each lesson, {STUDENT} has continued to apply {HIMSELF/HERSELF/THEMSELF} toward solving each problem with focus and determination. I commend {HIS/HER/THEIR} application while exploring technical concepts and {HIS/HER/THEIR} improved independence. Though {HE/SHE/THEY} must develop revision practices to enable {HIMSELF/HERSELF/THEMSELF} to maintain the high standards set in topic assessments by ensuring {HE/SHE/THEY} learns key technical terminology and practices examination-style questions including carefully noting details expected in the mark scheme. I am hopeful that {STUDENT} will then demonstrate {HIS/HER/THEIR} progress with a positive performance in his Year {YEAR} Examination"}]}
{"messages": [{"role": "user", "content": "Generate a report for a student who has done the following: performed much better in a recent assignment but still needs to improve by focusing on completing work take time when answering questions. Student didn't read the homework correctly which resulted in the work being incorrect"},{"role": "assistant", "content": "Having completed a recent assessment on a crucial skill to {SUBJECT}, {STUDENT} has achieved an exceptional result. It is a pleasure to report {HIS/HER/THEIR} improved attainment grade since {HIS/HER/THEIR} more closely matched {HIS/HER/THEIR} capability. However, {HE/SHE/THEY} must focus on {HIS/HER/THEIR} consistency in {HIS/HER/THEIR} application for all types of assignments. {HE/SHE/THEY} must always use the range of resources which are at {HIS/HER/THEIR} disposal and avoid rushing {HIS/HER/THEIR} answers. Furthermore, by not reading a homework task correctly, {STUDENT} missed the entire point of the task. Therefore, greater care and attention to detail will enable {HIM/HER/THEM} to continue to progress."}]}
{"messages": [{"role": "user", "content": "Generate a report for a student who has done the following: performed better in a recent exam and has improved at technical aspects of the subject but doesn't use all of the resources available to help support better homework. Also needs to ask for support in lesson to achieve grades 8 & 9 at GCSE"},{"role": "assistant", "content": "Having recently improved {HIS/HER/THEIR} responses to assessment questions, {STUDENT} has clearly developed {HIS/HER/THEIR} revision techniques. {HE/SHE/THEY} has become more proficient in {HIS/HER/THEIR} technical skills which has allowed {STUDENT} to solve problems and is an active member of the class. However, disappointingly, a recent homework, provided him with some challenges which could have been rectified by more efficient use of additional resources. {HE/SHE/THEY} could also ask for support to clarify {HIS/HER/THEIR} understanding. {STUDENT} is able to achieve higher results in the grade ‘8’ and ‘9’ range, if {HE/SHE/THEY} applies a little more consistency in {HIS/HER/THEIR} study techniques and I encourage {HIM/HER/THEM} to read more widely"}]}
{"messages": [{"role": "user", "content": "Generate a report for a student who has done the following: always attempts challenging work and works well independently but could improve further by reading questions before answering them and work on the technical skills of the subject"},{"role": "assistant", "content": "A hard-working and conscientious learner, {STUDENT} has responded to each challenge during lessons with superb methodology. Furthermore, I have observed {HIM/HER/THEM} testing out {HIM/HER/THEIR} theories with greater confidence and willingness to make mistakes. {HE/SHE/THEY} has continually reacted to my feedback which has been to read assignments more carefully before responding to questions and complete further research of technical detail where necessary. Therefore, while {STUDENT}’s application is superb, {HE/SHE/THEY} must continually revise of concepts in order to perfect the technical detail in {HIS/HER/THEIR} responses while under test conditions."}]}

I have just tried to upload your file via the fine-tuning UI after I could not spot any formatting issues. It was validated and the fine-tuning job started as expected. Same outcome when initiating the fine-tuning via python script.

Is your file properly saved as JSONL file?

1 Like

Here’s the file I’m uploading to the OpenAI fine-tuning section of the site

And here’s my configuration settings of the fine-tuning model on OpenAI’s API site:

I’m really not sure anymore what it could be given it worked on my end. Perhaps just copy your data again in a code editor such as VS Code and save it as a JSONL file. That’s after all what I just did based on the data you shared, which worked for me.

1 Like

Genuinely no idea why but putting the code into VS Code and then using that file worked. It’s literally the same code except I made the file in VSC rather than converting a .rtf file to .jsonl in properties

1 Like

Glad to hear it worked out after all. Good luck with your project!