Issues with JSON assistant message in fine-tuning

CogniTech · August 1, 2024, 3:22am

Hi all. We have a training dataset that we have successfully fine-tuned with on Azure, however it fails on OpenAI platform.

Error: The job failed due to an invalid training file. Invalid file format. Line 1, message 3, key “content.str”: Input should be a valid string

The line:

{"messages": [{"role": "system", "content": "<role>You are a classification model designed to classify user requests' intent.</role> <task>Your task is to classify if the user's message includes an intent to call a system command, and if so, extract the required entities.</task> <context>The WhereIsUser intent is triggered when a user asks you where another user is. This intent includes one entity: User. **BEGINING EXAMPLE QUERY** 'Where is AdamMiltonBarker?' AdamMiltonBarker is a User entity. **END EXAMPLE QUERY**</context>"}, {"role": "user", "content": "Can you help me locate AdamMiltonBarker?"}, {"role": "assistant", "content": {"intent":"WhereIsUser","intent_action":"IsaTmsUsers|WhereIsUser","entities":[{"entity":"User","value":"AdamMiltonBarker"}],"thinking":"The user is requesting the location of AdamMiltonBarker, which matches the WhereIsUser intent."}}]}

This dataset works exactly as is on Azure, we have also tried wrapping the JSON in ", and it is cast as a string yet it fails no matter which way we modify the file. It is also valid JSON as is.

Any suggestions? TIA.

edwin.umanha · August 1, 2024, 3:35am

You can use this service for repair format text, uploading your archive and program shows you the fails:

It works for me

CogniTech · August 1, 2024, 3:48am

Hi, I mentioned in the first message that the JSON is valid, I have already checked it, equally the same dataset works correctly on Azure.

CogniTech · August 1, 2024, 3:51am

The link you provided does not help, it says it is invalid because it is JSONL, and fixes it by wrapping it in an array and adding commas, which is not valid JSONL.

The line that OpenAI says has an issue does not have an issue. I don’t think you read the original post nor the error message correctly, it is saying that the JSON output for the assistant response is not a string, it is cast to a string when creating the dataset.

CogniTech · August 1, 2024, 4:18am

The exact same dataset has just been successfully uploaded and processed on Azure:

CogniTech · August 1, 2024, 4:39am

The model is now fine-tuning on Azure, it appears there is a bug with OpenAI.

jr.2509 · August 1, 2024, 7:02am

CogniTech:

{“messages”: [{“role”: “system”, “content”: “You are a classification model designed to classify user requests’ intent. Your task is to classify if the user’s message includes an intent to call a system command, and if so, extract the required entities. The WhereIsUser intent is triggered when a user asks you where another user is. This intent includes one entity: User. BEGINING EXAMPLE QUERY ‘Where is AdamMiltonBarker?’ AdamMiltonBarker is a User entity. END EXAMPLE QUERY”}, {“role”: “user”, “content”: “Can you help me locate AdamMiltonBarker?”}, {“role”: “assistant”, “content”: {“intent”:“WhereIsUser”,“intent_action”:“IsaTmsUsers|WhereIsUser”,“entities”:[{“entity”:“User”,“value”:“AdamMiltonBarker”}],“thinking”:“The user is requesting the location of AdamMiltonBarker, which matches the WhereIsUser intent.”}}]}

The issue likely arises from the way you currently use quotes in the assistant’s response. You are missing the regular double quotes at the beginning and end of content. Additionally, you should replace the double quotes inside the JSON with single quotes.

{"role": "assistant", "content": "{'intent':'WhereIsUser','intent_action':'IsaTmsUsers|WhereIsUser','entities':[{'entity':'User','value':'AdamMiltonBarker'}],'thinking':'The user is requesting the location of AdamMiltonBarker, which matches the WhereIsUser intent.'}"}

CogniTech · August 1, 2024, 12:21pm

Hi I have mentioned, this is valid JSON, and the same dataset has been used to train o n Azure there is no reason for this dataset to fail. Your suggestion has already been tried as stated.

The error mentions it is not a valid string however the value is cast as a string when generating the dataset (string)$value

jr.2509 · August 1, 2024, 12:37pm

Using the below as the format for a training example, I had no problem getting a sample file which I uploaded via the OpenAI fine-tuning UI validated.

{"messages": [{"role": "system", "content": "You are a classification model designed to classify user requests' intent. Your task is to classify if the user's message includes an intent to call a system command, and if so, extract the required entities. The WhereIsUser intent is triggered when a user asks you where another user is. This intent includes one entity: User. BEGINING EXAMPLE QUERY 'Where is AdamMiltonBarker?' AdamMiltonBarker is a User entity. END EXAMPLE QUERY"}, {"role": "user", "content": "Can you help me locate AdamMiltonBarker?"},{"role": "assistant", "content": "{'intent':'WhereIsUser','intent_action':'IsaTmsUsers|WhereIsUser','entities':[{'entity':'User','value':'AdamMiltonBarker'}],'thinking':'The user is requesting the location of AdamMiltonBarker, which matches the WhereIsUser intent.'}"}]}

Good luck!

P.S.: While this may not be relevant to you, bear in mind that using Azure for fine-tuning bears significant extra cost due to the hourly hosting fee for each model which you do not incur when using OpenAI directly for fine-tuning.

CogniTech · August 1, 2024, 4:51pm

Hi, we are level 4 Microsoft For Startup Founders with credit sponsorship from OpenAI through that program. We integrate multiple platforms into our platform.

I understand what you are saying about the data, but the fact is the value is a valid string and the JSON is valid, it should be accepted.

jr.2509 · August 1, 2024, 4:53pm

It is not about whether it is a valid JSON or not but whether it meets the requirements for fine-tuning training data as per OpenAI’s guidelines:

{"messages": [{"role": "system", "content": "Marv is a factual chatbot that is also sarcastic."}, {"role": "user", "content": "What's the capital of France?"}, {"role": "assistant", "content": "Paris, as if everyone doesn't know that already."}]}
{"messages": [{"role": "system", "content": "Marv is a factual chatbot that is also sarcastic."}, {"role": "user", "content": "Who wrote 'Romeo and Juliet'?"}, {"role": "assistant", "content": "Oh, just some guy named William Shakespeare. Ever heard of him?"}]}
{"messages": [{"role": "system", "content": "Marv is a factual chatbot that is also sarcastic."}, {"role": "user", "content": "How far is the Moon from Earth?"}, {"role": "assistant", "content": "Around 384,400 kilometers. Give or take a few, like that really matters."}]}

Source: https://platform.openai.com/docs/guides/fine-tuning/preparing-your-dataset

If in doubt, OpenAI also offers a Python script to check the validity of your training data set: Data preparation and analysis for chat model fine-tuning | OpenAI Cookbook

diggychatbot · August 3, 2024, 9:06am

for me model was giving unpredictable replies. I was asking to read the data and format it in JSON. Don’t even mentioning that data was been read wrong,
messing up dates or whatsoever. Some times it was replying with text data unpredictably. The same doc could be interpreted as JSON one time, and text another.

os2 · October 7, 2024, 10:09am

same issue is with me even i have tried your sample data which is given in your documentation still i am getting this issue

The job failed due to an invalid training file. Invalid file format. Line 1, message 3, key “content.str”: Input should be a valid string

Topic		Replies	Views
Help needed regarding Fine tuning API	3	557	April 6, 2024
Fine tuning error: The job failed due to an invalid training file. Unexpected file format, expected either prompt/completion pairs or chat messages API gpt-35-turbo , api , json , fine-tuning-problems , response_format	17	4846	January 23, 2025
An error occurred while processing file 'file-name' and it cannot be used for fine-tuning. Details may be available in the file's status_details API fine-tuning , fine-tuning-problems	6	1881	September 18, 2023
Can someone help me (with fine-tuning) API fine-tuning , api , help-needed	13	2516	April 6, 2024
I am getting an invalid_request_error while creating Fine tuning job for GPT 3.5 turbo via API API gpt-35-turbo , fine-tuning	1	1854	August 27, 2023

Issues with JSON assistant message in fine-tuning

Related topics