How to structure fine tuned data

Hi im trying to fine tune but having trouble converting to JSONL. This is the format of my data: {
“messages”: [
{
“message”: [
{
“role”: “user”,
“content”: “Hello.”
},
{
“role”: “assistant”,
“content”: “Hey.”
},
]
},
{
“message”: [
{
“role”: “user”,
“content”: “Alex.”
},
{
“role”: “assistant”,
“content”: “Hey, Alex, this is Dana, I’m calling you from disco. How are you doing?”
},
{
“role”: “user”,
“content”: “I’m good, how are you?”
},
]
} It contains sales calls transcipts and lots of them in all in this one file. Is the structure for this right or does anyone know should all the transcripts of sales calls be in separate files. this is the error message I get for context: {
“error”: {
“message”: “Expected file to have JSONL format, where every line is a valid JSON dictionary. Line 1 is not a dictionary (HINT: line starts with: "{…").”,
“type”: “invalid_request_error”,
“param”: null,
“code”: null
}
}

I removed some convo in between so ignore if their is a bracket missing here and there

If you’re trying to fine-tune the older models, you want…

{“prompt”: “”, “completion”: “”}

If you’re fine-tuning one of the new 3.5 models, you want…

{“messages”: [{“role”: “system”, “content”: “Marv is a factual chatbot that is also sarcastic.”}, {“role”: “user”, “content”: “What’s the capital of France?”}, {“role”: “assistant”, “content”: “Paris, as if everyone doesn’t know that already.”}]}

I haven’t validated your unformatted code dump, but all the brackets and commas matter. Might want to try to use JSON validator on your dataset.

Good luck.

ETA: Fixed the order of examples!

1 Like

Carriage returns are reported to be not permitted within the jsons that make up a conversation example. Only single lines of all messages, like the sarcastic bot example.

Nobody has documented why this chatml format would be required on completion models, or explored the ability to do it the old way.

I think you can include the representations, i.e. \n and \r\n, just not the chr(13) and chr(10) ascii.

1 Like

Yes, meaning you can have multi_line ai and user message contents, but not pretty readable json of an example conversation like the fine tune announcement page shows.

1 Like

Sorry im still confused to how I would structure multiple conversations in one JSON file is my format valid? If not how would I structure having these multiple convos in one file with a split to show its moving onto another convo? Do they need to go in separate files

You want one json object on each line.

Your problem if that you’re using literal line-breaks in your json which is valid normally but not for fine-tuning. As @Foxalabs suggested, you should replace them with \n or \r\n…

so more like this? So more like this:
{“messages”: [{“role”: “user”,“content”: “Hello.”},{“role”: “assistant”,“content”: “Hey.”}],
[{“role”: “user”,“content”: “Alex.”},{“role”: “assistant”, “content”: “Hey, Alex, this is Dana, I’m calling you from disco. How are you doing?”},{“role”: “user”,“content”: “I’m good, how are you?”}]}?