Issues with JSON assistant message in fine-tuning

Hi all. We have a training dataset that we have successfully fine-tuned with on Azure, however it fails on OpenAI platform.

Error: The job failed due to an invalid training file. Invalid file format. Line 1, message 3, key “content.str”: Input should be a valid string

The line:

{"messages": [{"role": "system", "content": "<role>You are a classification model designed to classify user requests' intent.</role> <task>Your task is to classify if the user's message includes an intent to call a system command, and if so, extract the required entities.</task> <context>The WhereIsUser intent is triggered when a user asks you where another user is. This intent includes one entity: User. **BEGINING EXAMPLE QUERY** 'Where is AdamMiltonBarker?' AdamMiltonBarker is a User entity. **END EXAMPLE QUERY**</context>"}, {"role": "user", "content": "Can you help me locate AdamMiltonBarker?"}, {"role": "assistant", "content": {"intent":"WhereIsUser","intent_action":"IsaTmsUsers|WhereIsUser","entities":[{"entity":"User","value":"AdamMiltonBarker"}],"thinking":"The user is requesting the location of AdamMiltonBarker, which matches the WhereIsUser intent."}}]}

This dataset works exactly as is on Azure, we have also tried wrapping the JSON in ", and it is cast as a string yet it fails no matter which way we modify the file. It is also valid JSON as is.

Any suggestions? TIA.

You can use this service for repair format text, uploading your archive and program shows you the fails:

It works for me

Hi, I mentioned in the first message that the JSON is valid, I have already checked it, equally the same dataset works correctly on Azure.

The link you provided does not help, it says it is invalid because it is JSONL, and fixes it by wrapping it in an array and adding commas, which is not valid JSONL.

The line that OpenAI says has an issue does not have an issue. I don’t think you read the original post nor the error message correctly, it is saying that the JSON output for the assistant response is not a string, it is cast to a string when creating the dataset.

The exact same dataset has just been successfully uploaded and processed on Azure:

The model is now fine-tuning on Azure, it appears there is a bug with OpenAI.

The issue likely arises from the way you currently use quotes in the assistant’s response. You are missing the regular double quotes at the beginning and end of content. Additionally, you should replace the double quotes inside the JSON with single quotes.

{"role": "assistant", "content": "{'intent':'WhereIsUser','intent_action':'IsaTmsUsers|WhereIsUser','entities':[{'entity':'User','value':'AdamMiltonBarker'}],'thinking':'The user is requesting the location of AdamMiltonBarker, which matches the WhereIsUser intent.'}"}

Hi I have mentioned, this is valid JSON, and the same dataset has been used to train o n Azure there is no reason for this dataset to fail. Your suggestion has already been tried as stated.

The error mentions it is not a valid string however the value is cast as a string when generating the dataset (string)$value

Using the below as the format for a training example, I had no problem getting a sample file which I uploaded via the OpenAI fine-tuning UI validated.

{"messages": [{"role": "system", "content": "You are a classification model designed to classify user requests' intent. Your task is to classify if the user's message includes an intent to call a system command, and if so, extract the required entities. The WhereIsUser intent is triggered when a user asks you where another user is. This intent includes one entity: User. BEGINING EXAMPLE QUERY 'Where is AdamMiltonBarker?' AdamMiltonBarker is a User entity. END EXAMPLE QUERY"}, {"role": "user", "content": "Can you help me locate AdamMiltonBarker?"},{"role": "assistant", "content": "{'intent':'WhereIsUser','intent_action':'IsaTmsUsers|WhereIsUser','entities':[{'entity':'User','value':'AdamMiltonBarker'}],'thinking':'The user is requesting the location of AdamMiltonBarker, which matches the WhereIsUser intent.'}"}]}

Good luck!

P.S.: While this may not be relevant to you, bear in mind that using Azure for fine-tuning bears significant extra cost due to the hourly hosting fee for each model which you do not incur when using OpenAI directly for fine-tuning.

Hi, we are level 4 Microsoft For Startup Founders with credit sponsorship from OpenAI through that program. We integrate multiple platforms into our platform.

I understand what you are saying about the data, but the fact is the value is a valid string and the JSON is valid, it should be accepted.

It is not about whether it is a valid JSON or not but whether it meets the requirements for fine-tuning training data as per OpenAI’s guidelines:

{"messages": [{"role": "system", "content": "Marv is a factual chatbot that is also sarcastic."}, {"role": "user", "content": "What's the capital of France?"}, {"role": "assistant", "content": "Paris, as if everyone doesn't know that already."}]}
{"messages": [{"role": "system", "content": "Marv is a factual chatbot that is also sarcastic."}, {"role": "user", "content": "Who wrote 'Romeo and Juliet'?"}, {"role": "assistant", "content": "Oh, just some guy named William Shakespeare. Ever heard of him?"}]}
{"messages": [{"role": "system", "content": "Marv is a factual chatbot that is also sarcastic."}, {"role": "user", "content": "How far is the Moon from Earth?"}, {"role": "assistant", "content": "Around 384,400 kilometers. Give or take a few, like that really matters."}]}

Source: https://platform.openai.com/docs/guides/fine-tuning/preparing-your-dataset

If in doubt, OpenAI also offers a Python script to check the validity of your training data set: Data preparation and analysis for chat model fine-tuning | OpenAI Cookbook

1 Like

for me model was giving unpredictable replies. I was asking to read the data and format it in JSON. Don’t even mentioning that data was been read wrong,
messing up dates or whatsoever. Some times it was replying with text data unpredictably. The same doc could be interpreted as JSON one time, and text another.

1 Like

same issue is with me even i have tried your sample data which is given in your documentation still i am getting this issue

The job failed due to an invalid training file. Invalid file format. Line 1, message 3, key “content.str”: Input should be a valid string