Hi all. We have a training dataset that we have successfully fine-tuned with on Azure, however it fails on OpenAI platform.
Error: The job failed due to an invalid training file. Invalid file format. Line 1, message 3, key “content.str”: Input should be a valid string
The line:
{"messages": [{"role": "system", "content": "<role>You are a classification model designed to classify user requests' intent.</role> <task>Your task is to classify if the user's message includes an intent to call a system command, and if so, extract the required entities.</task> <context>The WhereIsUser intent is triggered when a user asks you where another user is. This intent includes one entity: User. **BEGINING EXAMPLE QUERY** 'Where is AdamMiltonBarker?' AdamMiltonBarker is a User entity. **END EXAMPLE QUERY**</context>"}, {"role": "user", "content": "Can you help me locate AdamMiltonBarker?"}, {"role": "assistant", "content": {"intent":"WhereIsUser","intent_action":"IsaTmsUsers|WhereIsUser","entities":[{"entity":"User","value":"AdamMiltonBarker"}],"thinking":"The user is requesting the location of AdamMiltonBarker, which matches the WhereIsUser intent."}}]}
This dataset works exactly as is on Azure, we have also tried wrapping the JSON in ", and it is cast as a string yet it fails no matter which way we modify the file. It is also valid JSON as is.
The link you provided does not help, it says it is invalid because it is JSONL, and fixes it by wrapping it in an array and adding commas, which is not valid JSONL.
The line that OpenAI says has an issue does not have an issue. I don’t think you read the original post nor the error message correctly, it is saying that the JSON output for the assistant response is not a string, it is cast to a string when creating the dataset.
The issue likely arises from the way you currently use quotes in the assistant’s response. You are missing the regular double quotes at the beginning and end of content. Additionally, you should replace the double quotes inside the JSON with single quotes.
{"role": "assistant", "content": "{'intent':'WhereIsUser','intent_action':'IsaTmsUsers|WhereIsUser','entities':[{'entity':'User','value':'AdamMiltonBarker'}],'thinking':'The user is requesting the location of AdamMiltonBarker, which matches the WhereIsUser intent.'}"}
Hi I have mentioned, this is valid JSON, and the same dataset has been used to train o n Azure there is no reason for this dataset to fail. Your suggestion has already been tried as stated.
The error mentions it is not a valid string however the value is cast as a string when generating the dataset (string)$value
Using the below as the format for a training example, I had no problem getting a sample file which I uploaded via the OpenAI fine-tuning UI validated.
{"messages": [{"role": "system", "content": "You are a classification model designed to classify user requests' intent. Your task is to classify if the user's message includes an intent to call a system command, and if so, extract the required entities. The WhereIsUser intent is triggered when a user asks you where another user is. This intent includes one entity: User. BEGINING EXAMPLE QUERY 'Where is AdamMiltonBarker?' AdamMiltonBarker is a User entity. END EXAMPLE QUERY"}, {"role": "user", "content": "Can you help me locate AdamMiltonBarker?"},{"role": "assistant", "content": "{'intent':'WhereIsUser','intent_action':'IsaTmsUsers|WhereIsUser','entities':[{'entity':'User','value':'AdamMiltonBarker'}],'thinking':'The user is requesting the location of AdamMiltonBarker, which matches the WhereIsUser intent.'}"}]}
Good luck!
P.S.: While this may not be relevant to you, bear in mind that using Azure for fine-tuning bears significant extra cost due to the hourly hosting fee for each model which you do not incur when using OpenAI directly for fine-tuning.
Hi, we are level 4 Microsoft For Startup Founders with credit sponsorship from OpenAI through that program. We integrate multiple platforms into our platform.
I understand what you are saying about the data, but the fact is the value is a valid string and the JSON is valid, it should be accepted.
It is not about whether it is a valid JSON or not but whether it meets the requirements for fine-tuning training data as per OpenAI’s guidelines:
{"messages": [{"role": "system", "content": "Marv is a factual chatbot that is also sarcastic."}, {"role": "user", "content": "What's the capital of France?"}, {"role": "assistant", "content": "Paris, as if everyone doesn't know that already."}]}
{"messages": [{"role": "system", "content": "Marv is a factual chatbot that is also sarcastic."}, {"role": "user", "content": "Who wrote 'Romeo and Juliet'?"}, {"role": "assistant", "content": "Oh, just some guy named William Shakespeare. Ever heard of him?"}]}
{"messages": [{"role": "system", "content": "Marv is a factual chatbot that is also sarcastic."}, {"role": "user", "content": "How far is the Moon from Earth?"}, {"role": "assistant", "content": "Around 384,400 kilometers. Give or take a few, like that really matters."}]}
for me model was giving unpredictable replies. I was asking to read the data and format it in JSON. Don’t even mentioning that data was been read wrong,
messing up dates or whatsoever. Some times it was replying with text data unpredictably. The same doc could be interpreted as JSON one time, and text another.