Weird Error while finetuning

prafull.sharma · October 21, 2021, 7:59pm

Hello Everyone,

I am trying to finetune curie with custom dataset, I prepared the dataset successfully (as can be seen in the image), it creates the jsonl file. However, when I am running the finetuning. I am getting error as shown in the attached image.

Would be grateful if someone could help me solve it.

Regards

boris · October 21, 2021, 8:03pm

I’ve seen this happen before on Windows. Can you try to specify the full path to the file?

prafull.sharma · October 21, 2021, 8:05pm

Hi @boris thank you very much for replying. You’re right, I use windows, but I am using git bash (as I was facing some issues installing cli on windows), let me try with full path.

boris · October 24, 2021, 1:59am

did the full path work?

prafull.sharma · October 24, 2021, 4:21pm

Hello @boris , the full path didn’t work either. So I simply switched to Ubuntu.

DutytoDevelop · October 24, 2021, 8:10pm

More on this @prafull.sharma and @boris, I decided to look into the error as to why the character 0x9d wasn’t mapping to anything (undefined) and found this:

In Windows, the default encoding is cp1252 so when calling the open() function, it tries ‘cp1252’ instead of ‘UTF-8’, but that file is most likely encoded in UTF-8:
It looks like the cli.py file in the OpenAI module calls open() 3 times without passing in the ‘encoding’ parameter:

image804×93 4.77 KB
It looks like to fix this, the cli.py module needs to change the open() function calls to have encoding=“UTF-8” has one of the parameters:

I don’t think the JSONL file was even created due to this error, which is why @prafull.sharma wasn’t able to retrieve it. After looking through the traceback provided, it became clear that this is most likely the reason why. Including encoding=‘utf-8’ in the open() function so that Python 3 in Windows knows to use that encoding should fix this error!

If anyone is getting this error and really wants to fix this right now, then a temporary solution would be to modify the cli.py file in the OpenAI Python 3 site-packages folder to include the encoding=‘UTF-8’ parameter.

I think I may try to fine-tune a model tonight on Windows just to see if I can replicate the error to confirm that adding the encoding=“UTF-8” parameter does the trick.

Let me know if this helps @boris.

Sources:
StackOverflow - Charmap decoding error
StackOverflow - Unable to decode byte 0x9d
StackOverflow - Python 3 Default Encoding CP1252

DutytoDevelop · October 24, 2021, 9:01pm

Actually, I’m performing the fine-tuning now with a test JSONL file I made. I’ll be using the GIT bash like @prafull.sharma since I can’t use CMD either (tries to make me open the file using another program)

boris · October 24, 2021, 9:46pm

Wow, thanks! I’ll try to get this fix deployed ASAP

DutytoDevelop · October 24, 2021, 9:55pm

Glad to help! I was able to recreate the error in a couple posts down. Further research shows that there’s a special right quote character that contains byte ‘0x9d’ when decoded.

'“”'.encode()
b'\xe2\x80\x9c\xe2\x80\x9d' # 0x9d byte inside ”

b'\xe2\x80\x9c\xe2\x80\x9d'.decode("cp1252")
Traceback (most recent call last):
  File "<stdin>", line 1, in <module>
  File "C:\Users\iadmin\AppData\Local\Programs\Python\Python39\lib\encodings\cp1252.py", line 15, in decode
    return codecs.charmap_decode(input,errors,decoding_table)
UnicodeDecodeError: 'charmap' codec can't decode byte 0x9d in position 5: character maps to <undefined>

b'\xe2\x80\x9c\xe2\x80\x9d'.decode("UTF-8")
'“”'

I got the same error that @prafull.sharma did, so the special right quote is most likely the trouble character in this scenario. Specifying “UTF-8” when opening files should fix the error!

prafull.sharma · October 24, 2021, 10:03pm

Thanks @DutytoDevelop .
Actually the jsonl file does get created when using gitbash. I checked the content as well, and it looked fine to me.

But while running the fine-tuning command (with git bash), I faced this error.

However, when I used the same jsonl file (which was created in git bash) to fine tune on an Ubuntu machine, it runs smoothly.

DutytoDevelop · October 24, 2021, 10:20pm

@prafull.sharma, I was able to create the JSONL file too and recreate the error you got by having the special right quote character in the JSONL file. The default file encoding in Ubuntu is UTF-8, while Windows default is Windows-1252 (“CP1252” in Python) which is why you were able to get success with using Ubuntu to fine-tune with your prepared JSONL file.

My test-finetune2_prepared.jsonl file looks like this exactly:

{"prompt":"1Finetune Windows Error Test with special quotes that contain byte 0x9d”””","completion":" 1Testing... Please ignore”””"}
{"prompt":"2Finetune Windows Error Test”””","completion":" 2Testing... Please ignore”””"}
{"prompt":"3Finetune Windows Error Test”””","completion":" 3Testing... Please ignore”””"}

Success!

I modified the cli.py file to include the encoding=“utf-8” parameter within the open() function and I was able to get past the error with the same JSONL file!!

jackcole · October 26, 2021, 1:48pm

I have had this type of error before related to string characters not being properly escaped. I run a python script to properly escape the strings before I convert to the JSONL format for GPT3 fine-tuning.

DutytoDevelop · October 26, 2021, 1:54pm

Well, now you won’t have to

NSY · October 28, 2021, 4:45am

Great job. Please share the process of how to modify it.
The way I handle it today is that every time before I use the preparation tool I am using Notepad++ to convert the file to UTF-8 and for the actual fine tuning I use the Pythonanywhere bash. The encoding issue is quite irritating.

DutytoDevelop · October 29, 2021, 8:31am

Hello @NSY,

If you want to fix this right away, you will need to modify the cli.py file in the OpenAI module that you’ve installed. The changes we will make will allow you to open and work with files that may cause errors during fine-tuning and will not result in any other issues:

Get the install path of Python. An easy way is to simply open Command Prompt or PowerShell and run this:

Python -c "import sys; print(sys.executable)" 
# For me, I'll get: C:\Users\iadmin\AppData\Local\Programs\Python\Python39\python.exe

From that directory, enter .\Lib\site-packages\openai and you’ll find the ‘cli.py’ file that we’ll need to edit:

...\Python39\Lib\site-packages\openai\cli.py

Open ‘cli.py’ in any file editor, preferably a file editor with the ability to find / search for specific text. You’ll search for “open(”. There are 3 instances in cli.py. For every instance, add the ‘encoding’ parameter to the open() function exactly as follows:

image789×75 3.89 KB

# Add: encoding="utf-8" to the open() function so that each instance now looks like this:
Line 204:             file=open(args.file,encoding="utf-8"),
Line 250:                     file=open(file,encoding="utf-8"), purpose="fine-tune"
Line 283:                     file=open(file,encoding="utf-8"),

Save the modified ‘cli.py’ file and you should now be able to fine-tune with the previous files that would trigger errors!

Let me know if there are any issues so I can assist further!

NSY · October 29, 2021, 9:50am

Thanks a lot! I’ll try it out.

DutytoDevelop · October 29, 2021, 5:28pm

Sounds good! If any issues come up, just reach back out on here and I’ll do my best to assist!

NSY · November 1, 2021, 9:04pm

OMG, you just saved me a year in my life. You deserve heaven! (In many years to come, don’t worry). Thanks a lot for this.

DutytoDevelop · November 1, 2021, 9:25pm

Glad to hear that the fix works!

tanner49 · November 3, 2021, 10:22pm

Thank you for saving my life! Worked perfectly. Hope the fix goes out soon.

Topic		Replies	Views
Finetunes.create: No such File object Documentation	4	1894	December 23, 2021
ERROR in read_any_format validator: File 'my jsonl file' does not exist API api	7	787	December 25, 2023
How to fix UnicodeDecodeError on the OpenAi data preparation tool? API	7	5713	February 22, 2022
Fine-tuning Fails due to Syntax API	6	2593	December 19, 2023
Having trouble while installing and using openai with pip API	9	2598	December 19, 2023

Weird Error while finetuning

Success!

Related topics