Failing to fine-tune a model, invalid jsonl format no matter what

Hi,

I’m trying to fine tune a model, but for some reason, it always fails and tells me that my file does not appear to be in valid JSON format.

I have tried both .jsonl and .json, but both fail. I am running the following:

openai tools fine_tunes.prepare_data -f my_file.json

Here are a few examples that I tried, the only difference is formatting, hopefully that will be reflected here:

{
    "prompt": "Hi, I can't connect to the board. Can you please help? I'm in SS4. Thank you, Adam", 
    "completion": "Hello Adam, We will try to get someone to see you as quickly as possible, but in the mean time, please try the following: Make sure that the docking station is receiving power. You can tell by the small light visible on the docking station after you connect your device to it."
}

and

{
    "prompt": "Hi, I can't connect to the board. Can you please help? I'm in SS4. Thank you, Adam", 
    "completion": "Hello Adam, We will try to get someone to see you as quickly as possible, but in the mean time, please try the following: Make sure that the docking station is receiving power. You can tell by the small light visible on the docking station after you connect your device to it."
}

and

{"prompt": "Hi, I can't connect to the board. Can you please help? I'm in SS4. Thank you, Adam", "completion": "Hello Adam, We will try to get someone to see you as quickly as possible, but in the mean time, please try the following: Make sure that the docking station is receiving power. You can tell by the small light visible on the docking station after you connect your device to it."}

Any ideas what’s wrong with the file?

You must use .jsonl extension.

According to my “under construction” validator, two of your three JSONL files are not valid.



Also, the JSONL spec is for each entry to be on a single line, so that makes sense. Two of your attempts span four lines, so those lines will not validate.

Maybe try the validated (single line) JSONL file and use the correct .jsonl extension and see how it goes?

HTH

Thank you for the reply

Please see the following example of the file, this time saved as .jsonl and I made sure that it only spans a single line. The image also shows the output of the terminal:

Here is the prompt copy-pasted from the file as is:

{"prompt": "Hi, I can't connect to the board. Can you please help? I'm in SS4. Thank you, Adam", "completion": "Hello Adam, We will try to get someone to see you as quickly as possible, but in the mean time, please try the following: Make sure that the docking station is receiving power. You can tell by the small light visible on the docking station after you connect your device to it."}

I wonder if it perhaps dislikes my environment. I’m on an older version of Win 10, creating the file via VSCode and running the fine-tuning via Git Bash.

1 Like

My experience is that it is best (less errors and headaches) to use the full path to the file in your File API call.

/the/full/path/to/your/file/data_train.jsonl

The API error message are beta, hit or miss (mostly miss), so I discovered this “the hard way”.

Here is how I do it…as an example:

module Files
    def self.get_client
        Ruby::OpenAI.configure do |config|
            config.access_token = ENV.fetch('OPENAI_API_KEY')
        end
        client = OpenAI::Client.new
    end


    def self.upload(filename="#{Rails.root}/app/assets/files/fine-tune.jsonl",purpose='fine-tune')
        client = get_client
        response =client.files.upload(
            parameters: {
                    file: filename,
                    purpose: purpose,
                })
        file_id = JSON.parse(response.body)["id"]
        file_id 
    end
end

Footnote

Please note @Arivald, even if your data passes JSONL validation, you must use a specific format in your text for fine-tuning. Your current JSONL text will pass JSONL but it will does not “pass” (there is no validator for this, but I have one), for the API requirements.

Most Developers Seem to Miss this Requirement

Reference:

Preparing You Dataset

Here is an example of a fully “API Validated” entry. Note I use “PROMPT_SEPARATOR” and “STOPSTOP” in this example. You can choose whatever you like.

1 Like

Hi @ruby_coder, could you also look at my dataset? was looking around why my jsonl file is not being accepted as a valid jsonl file. And you seem to be answering all the questions

OK. Please post here using Markdown triple backticks like this:

```
# your json data here
```

I will try to test for you in between sanding sessions, hand sanding my teak wood floor (almost finished after a year, yay!) today.

Coding is fun. Sanding a super hard teak-wood floor with a 5" random orbital sander, by hand over a year, that is not fun.

HTH

:slight_smile:

I replaced some part of the text with “…” because it is very long.

{"prompt": "You are a water expert and rewrite the product description to state what it is, what does it do and how it works. Keep it technical and factual, don’t use sales language or too many adjectives. Write in third person view.\nProduct name: SigaPlatform\nProduct description: SigaPlatform is an AI-driven predictive maintenance and cyber security platform, designed to protect critical industrial assets (pumps, valves, etc) at the operational technology (OT) level, ...  to equipment, people or the environment.\n\n###\n\n", "completion": "SigaPlatform is an AI-driven cyber security  ... and enabling full regulatory compliance.\n####"},
{"prompt": "You are a water expert and rewrite the product description to state what it is, what does it do and how it works. Keep it technical and factual, don’t use sales language or too many adjectives. Write in third person view.\nProduct name: Mobile Organic Biofilm (MOB) Process\nProduct description: The Mobile Organic Biofilm, or the MOBTM, is Nuvoda's proprietary ... time needed for retrofit.\n\n###\n\n", "completion": "The Mobile Organic Biofilm (MOB) process increases ... aerobic granular sludge (AGS).\n####"}

Do I have to put every thing into a “”? Is there suppose to be a comma after every fine tune data?
Sorry I’m a new programmer and was thrown into learning AI.

No. JSONL does not have commas between the hashes and there are no brackets required to designate an array.

Checking now before I get back to sanding…

Hold on.

1 Like

Hi @dc.vistro

Your JSONL data will not validate because you have a comma at the end of your line(s).

There are no commas at the end of a JSONL line.

JSONL validation is different than JSON validation.

Hope the helps

:slight_smile:

Note, if I remove the errant JSONL EOL comma. your data validates JSONL-wise; and if I add a space at the beginning of your completion, it will pass OpenAI fine-tuning validation also:

1 Like

Thank you very much @ruby_coder. I will try your suggestion.

You are welcome, @dc.vistro

I will not be around for most of the rest of the day (heading to gym), so if you run into more issues, please review this post here in our community:

:slight_smile:

I’m using,

openai tools fine_tunes.prepare_data -f (file location)

edited the Jsonl file as you suggested and i’m still getting

ERROR in read_any_format validator: Your file (file location) does not appear to be in valid JSONL format. Please ensure your file is formatted as a valid JSONL file.

Sorry, @dc.vistro, I deleted the openai CLI tools less than an hour after installing and testing it.

So I cannot help you WRT the CLI.

I am sure others can help you use the CLI much better than me!

:slight_smile: :slight_smile:

it’s alright. Thank you with your help.

You are welcome.

You will get a similar error as you have experienced if your code does not point to a valid file location.

The OpenAI errors messages are often obscure and can be misleading due to the “beta” nature of the API release.

:slight_smile:

OBTW, the correct CLI syntax is:

openai tools fine_tunes.prepare_data -f <LOCAL_FILE>

So you should consider insuring the path to your <LOCAL_FILE> is correct.

See:

CLI data preparation tool

sorry for bothering you @ruby_coder , I am in trouble with the incorrect format of jsonl. Here is my prompt(description) and completion(classification).

{"prompt":"生物教育研究应从客观存在的生物教育对象的事实和现象出发,分析和解决生物教育理论和实践中存在的问题。具体的问题来源有:教育教学的疑难、学校或学科发展、具体的教育教学场景、阅读交流。根据我们自身的情况,个人认为大家可以思考分享在微格课中遇到的教学难点、或在实际教学场景中的一些启发。这样选出的课题比较小,便于我们实施,可行性高;且对我们的微格课训练有帮助,有价值性;且关于微格课的文献较少,选题有创新性;研究生物教学难点也是促进生物教育理论与实践发展的重要一环,内容丰富,从中找出有创新性的选题也相对方便简单。 PROMPT_SEPARATOR", "completion":" C. STOPSTOP"}
{"prompt":"查阅资料,小课题是从教情、学情、校情出发,由教师个人或科组教师共同确立、研究的直接服务于教育教学实践的应用性课题。它比大课题更有实践性与应用性,更方便开展。另一方面,由于小课题的应用性,它的研究问题是以教学问题为中心的学校涉及学生素质全面发展的所有问题。解决小课题,有利于整个校园学习氛围的提升。所以“如何做好小课题研究”也是值得探讨的。参考文献:吴全华,教师小课题研究的特点与基本条件,广东教育:综合版, 2006 (7) :35-38. PROMPT_SEPARATOR", "completion":" A. STOPSTOP"}
{"prompt":"信息化环境下教学使用的交互式智能平板是否会在某一天替代黑板,两者如何合理使用达到更好地提高学生学习效率的目的?可以交流一下“信息化教学和传统教学如何结合利用”。 PROMPT_SEPARATOR", "completion": " I. STOPSTOP"}

I have read your resolution suggested above and tried several times, but it still told me

ERROR in read_any_format validator: Your file `data.jsonl` does not appear to be in valid JSONL format. Please ensure your file is formatted as a valid JSO
NL file.

I was wondering if it is the format trouble or if I am in the same situation with @dc.vistro