Unicode in jsonl dataset for fine tuning

sadegh.tkd · June 24, 2024, 3:02am

Hi,
I made a dataset for fine tuning GPT-3.5 with Python.
I have Unicode text and Python json.dumps convert the unicode text to format like \u0623 for each character.
My question is the fine tuning system of OpenAI process that such formatted text in a correct way?
Because my dataset is very small for now, I’m not sure the answers are from her own knowledge or the examples I have given.

dignity_for_all · June 24, 2024, 3:37am

I see you used json.dumps, but for fine-tuning, the required format is JSONL. For example:

{"messages": [{"role": "system", "content": "Marv is a factual chatbot that is also sarcastic."}, {"role": "user", "content": "What's the capital of France?"}, {"role": "assistant", "content": "Paris, as if everyone doesn't know that already."}]}
{"messages": [{"role": "system", "content": "Marv is a factual chatbot that is also sarcastic."}, {"role": "user", "content": "Who wrote 'Romeo and Juliet'?"}, {"role": "assistant", "content": "Oh, just some guy named William Shakespeare. Ever heard of him?"}]}
{"messages": [{"role": "system", "content": "Marv is a factual chatbot that is also sarcastic."}, {"role": "user", "content": "How far is the Moon from Earth?"}, {"role": "assistant", "content": "Around 384,400 kilometers. Give or take a few, like that really matters."}]}

Each line should start with {“messages” and end with }.
If there are still escape sequences, you can decode them like this:

jsonl_data_decode = jsonl_data_unicode.decode('unicode-escape')

I’m not sure if the fine-tuning will work well with the escape sequences, but it’s safer to decode them.

You can also refer to the OpenAI Cookbook’s format-validation for checking if your data is valid for fine-tuning: OpenAI Cookbook Format Validation

sadegh.tkd · June 24, 2024, 4:02am

I have used the JSONL format but my text in the “content” field is something like this “content”: “\u062f\u0631\u062e\u0648\u0627”.
Also the OpenAI fine tuning dashboard accepted my file but my question is that OpenAI process that such encoded text in a correct way?

dignity_for_all · June 24, 2024, 4:09am

It depends on what you mean by the correct way.

If you want output from a fine-tuned model like “\u062f\u0631\u062e\u0648\u0627”, then it might work.
If you want the normal decoded text output from the fine-tuned model, then I do not think so.

kenza_ily · September 21, 2024, 10:45am

Hi, I confirm I had the same problem fine-tuning with Unicode- it brought lots of hallucinations - in unicode. Decoding is definitely a better option.

Topic		Replies	Views
Invalid file format- Issues with encoding different languages and emojis in Fine Tuning Community gpt-4 , fine-tuning	0	152	August 5, 2024
Support of unicode in gpt4-1106-preview Bugs gpt-4 , api	10	2507	November 15, 2024
Fine Tuning with escaped strings API fine-tuning	4	1207	January 2, 2024
Invalid file format - Encoding issue API	3	1686	October 1, 2023
Model returning malformed characters in JSON response using API Bugs	8	547	January 5, 2026

Unicode in jsonl dataset for fine tuning

Related topics