How to structure fine tuned data

hoggarth · September 1, 2023, 10:44pm

Hi im trying to fine tune but having trouble converting to JSONL. This is the format of my data: {
“messages”: [
{
“message”: [
{
“role”: “user”,
“content”: “Hello.”
},
{
“role”: “assistant”,
“content”: “Hey.”
},
]
},
{
“message”: [
{
“role”: “user”,
“content”: “Alex.”
},
{
“role”: “assistant”,
“content”: “Hey, Alex, this is Dana, I’m calling you from disco. How are you doing?”
},
{
“role”: “user”,
“content”: “I’m good, how are you?”
},
]
} It contains sales calls transcipts and lots of them in all in this one file. Is the structure for this right or does anyone know should all the transcripts of sales calls be in separate files. this is the error message I get for context: {
“error”: {
“message”: “Expected file to have JSONL format, where every line is a valid JSON dictionary. Line 1 is not a dictionary (HINT: line starts with: "{…").”,
“type”: “invalid_request_error”,
“param”: null,
“code”: null
}
}

hoggarth · September 1, 2023, 10:45pm

I removed some convo in between so ignore if their is a bracket missing here and there

PaulBellow · September 1, 2023, 11:02pm

If you’re trying to fine-tune the older models, you want…

{“prompt”: “”, “completion”: “”}

If you’re fine-tuning one of the new 3.5 models, you want…

{“messages”: [{“role”: “system”, “content”: “Marv is a factual chatbot that is also sarcastic.”}, {“role”: “user”, “content”: “What’s the capital of France?”}, {“role”: “assistant”, “content”: “Paris, as if everyone doesn’t know that already.”}]}

I haven’t validated your unformatted code dump, but all the brackets and commas matter. Might want to try to use JSON validator on your dataset.

Good luck.

ETA: Fixed the order of examples!

_j · September 2, 2023, 7:02am

Carriage returns are reported to be not permitted within the jsons that make up a conversation example. Only single lines of all messages, like the sarcastic bot example.

Nobody has documented why this chatml format would be required on completion models, or explored the ability to do it the old way.

Foxalabs · September 2, 2023, 7:07am

I think you can include the representations, i.e. \n and \r\n, just not the chr(13) and chr(10) ascii.

_j · September 2, 2023, 7:11am

Yes, meaning you can have multi_line ai and user message contents, but not pretty readable json of an example conversation like the fine tune announcement page shows.

hoggarth · September 2, 2023, 9:19am

Sorry im still confused to how I would structure multiple conversations in one JSON file is my format valid? If not how would I structure having these multiple convos in one file with a split to show its moving onto another convo? Do they need to go in separate files

PaulBellow · September 2, 2023, 9:33am

You want one json object on each line.

Your problem if that you’re using literal line-breaks in your json which is valid normally but not for fine-tuning. As @Foxalabs suggested, you should replace them with \n or \r\n…

hoggarth · September 2, 2023, 11:20am

so more like this? So more like this:
{“messages”: [{“role”: “user”,“content”: “Hello.”},{“role”: “assistant”,“content”: “Hey.”}],
[{“role”: “user”,“content”: “Alex.”},{“role”: “assistant”, “content”: “Hey, Alex, this is Dana, I’m calling you from disco. How are you doing?”},{“role”: “user”,“content”: “I’m good, how are you?”}]}?

Topic		Replies	Views
Fine tuning error: The job failed due to an invalid training file. Unexpected file format, expected either prompt/completion pairs or chat messages API gpt-35-turbo , api , json , fine-tuning-problems , response_format	15	251	April 25, 2024
An error occurred while processing file 'file-name' and it cannot be used for fine-tuning. Details may be available in the file's status_details API fine-tuning , fine-tuning-problems	6	1405	September 18, 2023
Help needed regarding Fine tuning API	3	147	April 6, 2024
Can someone help me (with fine-tuning) API fine-tuning , api , help-needed	13	2017	April 6, 2024
Fine tune on multi-turn conversations API fine-tuning , fine-tuning-problems	0	625	October 11, 2023

How to structure fine tuned data

Related Topics