Generating files for fine-tuning

Hi there!

I am currently trying to generate (programmatically) a dataset file (.jsonl), which I want to use for fine tuning a GPT-3 model.

The output that is being generated currently looks like this:

[{"prompt":"Some input text", "completion":"Some completion text"}, {"prompt":"Another input text", "completion":"Another completion text"}]

In the documentation, I see that all examples is without starting brackets [].

E.g.:

{"prompt":"Company: BHFF insurance\nProduct: allround insurance\nAd:One stop shop for all your insurance needs!\nSupported:", "completion":" yes"}
{"prompt":"Company: Loft conversion specialists\nProduct: -\nAd:Straight teeth in weeks!\nSupported:", "completion":" no"}

Now my question is: Do I need to remove the surrounding array brackets ([]) from my dataset.jsonl file before using it to fine-tune?

I just dump them one DICT at a time. You can see one of my scripts here AutoMuse2/format_jsonl.py at main · daveshap/AutoMuse2 · GitHub

Thanks, Dave! I’m currently generating the training data using PHP.

It’s not a problem for me to structure the output like the documentation is referring to. Just wanted to hear if it had any impact of output/accuracy.

yes, you need to remove the [] otherwise it will break as far as I know

1 Like