I am trying to finetune curie with custom dataset, I prepared the dataset successfully (as can be seen in the image), it creates the jsonl file. However, when I am running the finetuning. I am getting error as shown in the attached image.
Would be grateful if someone could help me solve it.
Hi @boris thank you very much for replying. You’re right, I use windows, but I am using git bash (as I was facing some issues installing cli on windows), let me try with full path.
More on this @prafull.sharma and @boris, I decided to look into the error as to why the character 0x9d wasn’t mapping to anything (undefined) and found this:
In Windows, the default encoding is cp1252 so when calling the open() function, it tries ‘cp1252’ instead of ‘UTF-8’, but that file is most likely encoded in UTF-8:
It looks like the cli.py file in the OpenAI module calls open() 3 times without passing in the ‘encoding’ parameter:
It looks like to fix this, the cli.py module needs to change the open() function calls to have encoding=“UTF-8” has one of the parameters:
I don’t think the JSONL file was even created due to this error, which is why @prafull.sharma wasn’t able to retrieve it. After looking through the traceback provided, it became clear that this is most likely the reason why. Including encoding=‘utf-8’ in the open() function so that Python 3 in Windows knows to use that encoding should fix this error!
If anyone is getting this error and really wants to fix this right now, then a temporary solution would be to modify the cli.py file in the OpenAI Python 3 site-packages folder to include the encoding=‘UTF-8’ parameter.
I think I may try to fine-tune a model tonight on Windows just to see if I can replicate the error to confirm that adding the encoding=“UTF-8” parameter does the trick.
Actually, I’m performing the fine-tuning now with a test JSONL file I made. I’ll be using the GIT bash like @prafull.sharma since I can’t use CMD either (tries to make me open the file using another program)
Glad to help! I was able to recreate the error in a couple posts down. Further research shows that there’s a special right quote character that contains byte ‘0x9d’ when decoded.
'“”'.encode()
b'\xe2\x80\x9c\xe2\x80\x9d' # 0x9d byte inside ”
b'\xe2\x80\x9c\xe2\x80\x9d'.decode("cp1252")
Traceback (most recent call last):
File "<stdin>", line 1, in <module>
File "C:\Users\iadmin\AppData\Local\Programs\Python\Python39\lib\encodings\cp1252.py", line 15, in decode
return codecs.charmap_decode(input,errors,decoding_table)
UnicodeDecodeError: 'charmap' codec can't decode byte 0x9d in position 5: character maps to <undefined>
b'\xe2\x80\x9c\xe2\x80\x9d'.decode("UTF-8")
'“”'
I got the same error that @prafull.sharma did, so the special right quote is most likely the trouble character in this scenario. Specifying “UTF-8” when opening files should fix the error!
@prafull.sharma, I was able to create the JSONL file too and recreate the error you got by having the special right quote character in the JSONL file. The default file encoding in Ubuntu is UTF-8, while Windows default is Windows-1252 (“CP1252” in Python) which is why you were able to get success with using Ubuntu to fine-tune with your prepared JSONL file.
My test-finetune2_prepared.jsonl file looks like this exactly:
{"prompt":"1Finetune Windows Error Test with special quotes that contain byte 0x9d”””","completion":" 1Testing... Please ignore”””"}
{"prompt":"2Finetune Windows Error Test”””","completion":" 2Testing... Please ignore”””"}
{"prompt":"3Finetune Windows Error Test”””","completion":" 3Testing... Please ignore”””"}
Success!
I modified the cli.py file to include the encoding=“utf-8” parameter within the open() function and I was able to get past the error with the same JSONL file!!
I have had this type of error before related to string characters not being properly escaped. I run a python script to properly escape the strings before I convert to the JSONL format for GPT3 fine-tuning.
Great job. Please share the process of how to modify it.
The way I handle it today is that every time before I use the preparation tool I am using Notepad++ to convert the file to UTF-8 and for the actual fine tuning I use the Pythonanywhere bash. The encoding issue is quite irritating.
If you want to fix this right away, you will need to modify the cli.py file in the OpenAI module that you’ve installed. The changes we will make will allow you to open and work with files that may cause errors during fine-tuning and will not result in any other issues:
Get the install path of Python. An easy way is to simply open Command Prompt or PowerShell and run this:
From that directory, enter .\Lib\site-packages\openai and you’ll find the ‘cli.py’ file that we’ll need to edit:
...\Python39\Lib\site-packages\openai\cli.py
Open ‘cli.py’ in any file editor, preferably a file editor with the ability to find / search for specific text. You’ll search for “open(”. There are 3 instances in cli.py. For every instance, add the ‘encoding’ parameter to the open() function exactly as follows:
# Add: encoding="utf-8" to the open() function so that each instance now looks like this:
Line 204: file=open(args.file,encoding="utf-8"),
Line 250: file=open(file,encoding="utf-8"), purpose="fine-tune"
Line 283: file=open(file,encoding="utf-8"),
Save the modified ‘cli.py’ file and you should now be able to fine-tune with the previous files that would trigger errors!
Let me know if there are any issues so I can assist further!