Unicode in jsonl dataset for fine tuning

Hi,
I made a dataset for fine tuning GPT-3.5 with Python.
I have Unicode text and Python json.dumps convert the unicode text to format like \u0623 for each character.
My question is the fine tuning system of OpenAI process that such formatted text in a correct way?
Because my dataset is very small for now, I’m not sure the answers are from her own knowledge or the examples I have given.

I see you used json.dumps, but for fine-tuning, the required format is JSONL. For example:

{"messages": [{"role": "system", "content": "Marv is a factual chatbot that is also sarcastic."}, {"role": "user", "content": "What's the capital of France?"}, {"role": "assistant", "content": "Paris, as if everyone doesn't know that already."}]}
{"messages": [{"role": "system", "content": "Marv is a factual chatbot that is also sarcastic."}, {"role": "user", "content": "Who wrote 'Romeo and Juliet'?"}, {"role": "assistant", "content": "Oh, just some guy named William Shakespeare. Ever heard of him?"}]}
{"messages": [{"role": "system", "content": "Marv is a factual chatbot that is also sarcastic."}, {"role": "user", "content": "How far is the Moon from Earth?"}, {"role": "assistant", "content": "Around 384,400 kilometers. Give or take a few, like that really matters."}]}

Each line should start with {“messages” and end with }.
If there are still escape sequences, you can decode them like this:

jsonl_data_decode = jsonl_data_unicode.decode('unicode-escape')

I’m not sure if the fine-tuning will work well with the escape sequences, but it’s safer to decode them.

You can also refer to the OpenAI Cookbook’s format-validation for checking if your data is valid for fine-tuning: OpenAI Cookbook Format Validation

2 Likes

I have used the JSONL format but my text in the “content” field is something like this “content”: “\u062f\u0631\u062e\u0648\u0627”.
Also the OpenAI fine tuning dashboard accepted my file but my question is that OpenAI process that such encoded text in a correct way?

It depends on what you mean by the correct way.

If you want output from a fine-tuned model like “\u062f\u0631\u062e\u0648\u0627”, then it might work.
If you want the normal decoded text output from the fine-tuned model, then I do not think so.