Hello, ![]()
I am attempting to fine-tune some models to have fun with various tasks. I’m a newb here so as you’d expect I found some surprising edges that took me a while to discover. I thought others and Google might benefit if I shared.
Following the Fine Tuning guide I got to the openai tools fine_tunes.prepare_data command. I quickly threw together a sample dataset to test it out and got the dreaded ERROR in read_any_format validator error:
❯ openai tools fine_tunes.prepare_data -f ./canadian-weather.jsonl
Analyzing...
ERROR in read_any_format validator: Your file `./canadian-weather.jsonl` does not appear to be in valid JSONL format. Please ensure your file is formatted as a valid JSONL file.
Aborting...%
Clearly I have a jsonl formatting issue and this will be trivial to fix, right? Yup! That’s what I said too, 90 minutes ago.
Here’s the contents of my canadian-weather.jsonl file looks:
{"prompt": "How is the weather today? PROMPT_SEPARATOR", "completion": " I stuck my head out the window and froze my ears off. It's -20 degrees celcius, or -45 with the windchill! What did you expect in a Canadian winter, eh? STOPSTOP"}
I followed all the advice I could find on this forum, including this excellent post from @ruby_coder. I used every JSONL validator I could find. Triple checked the encoding was UTF-8. Tried full file paths. And kept reducing the file further and further until I couldn’t anymore.
None of this resolved the validation issue.
Did you spot the issue yet? ![]()
![]()
If you guessed the file was too short, you are an excellent debugger! For those that just want to enjoy the show, I tried adding a second line to the file as follows:
{"prompt": "How is the weather today? PROMPT_STOP", "completion": " I stuck my head out the window and froze my ears off. It's -20 degrees celcius, or -45 with the windchill! What did you expect in a Canadian winter, eh?"}
{"prompt": "How's today's weather looking? PROMPT_STOP", "completion": " Whew, I wish you waited until it was warmer before asking. I bundled up in my mittens and touque and darm near didn't make it back alive. It's -25 degrees celcius now, or -50 with the windchill! Even worse than yesterday. We should visit California today, eh?"}
And now openai tools fine_tunes.prepare_data -f ./canadian-weather-multiline.jsonl was happy and provided some good feedback on this file.
❯ openai tools fine_tunes.prepare_data -f ./canadian-weather-multiline.jsonl
Analyzing...
- Your file contains 2 prompt-completion pairs. In general, we recommend having at least a few hundred examples. We've found that performance tends to linearly increase for every doubling of the number of examples
- More than a third of your `prompt` column/key is uppercase. Uppercase prompts tends to perform worse than a mixture of case encountered in normal language. We recommend to lower case the data if that makes sense in your domain. See https://beta.openai.com/docs/guides/fine-tuning/preparing-your-dataset for more details
- All prompts end with suffix `? PROMPT_STOP`. This suffix seems very long. Consider replacing with a shorter suffix, such as ` ->`
- All prompts start with prefix `How`
- All completions end with suffix `, eh?`
Based on the analysis we will perform the following actions:
- [Recommended] Lowercase all your data in column/key `prompt` [Y/n]:
After making the recommended changes I ended up with this dataset file:
{"prompt": "How is the weather today? ->", "completion": " I stuck my head out the window and froze my ears off. It's -20 degrees celcius, or -45 with the windchill! What did you expect in a Canadian winter, eh?"}
{"prompt": "How's today's weather looking? ->", "completion": " Whew, I wish you waited until it was warmer before asking. I bundled up in my mittens and touque and darm near didn't make it back alive. It's -25 degrees celcius now, or -50 with the windchill! Even worse than yesterday. We should visit California today, eh friend?"}
This passed all the checks so I’ve proceeded to the fine_tunes.create step for creating my Bob & Doug Mckenzie weather service. ![]()
![]()
Hopefully this saves others some trouble and provides some additional perspective to the validation team. Starting with a single prompt seemed like a great starting point before scaling things up. It wasn’t a bad idea, but for now at least, start with two entries. ![]()