How to fix UnicodeDecodeError on the OpenAi data preparation tool?

Could anyone please tell me how to fix this problem. I was trying to prepare my dataset in a json format using the OpenAi data preparation tool (openai tools fine_tunes.prepare_data -f <LOCAL_FILE>), but I got the following message in my terminal. You kind answer will be much appreciated. Thanks!

File “pandas/_libs/parsers.pyx”, line 544, in pandas._libs.parsers.TextReader.cinit
File “pandas/_libs/parsers.pyx”, line 633, in pandas._libs.parsers.TextReader._get_header
File “pandas/_libs/parsers.pyx”, line 847, in pandas._libs.parsers.TextReader._tokenize_rows
File “pandas/_libs/parsers.pyx”, line 1952, in pandas._libs.parsers.raise_parser_error
UnicodeDecodeError: ‘utf-8’ codec can’t decode byte 0xe9 in position 7684: invalid continuation byte

2 Likes

Welcome to the OpenAI community @conda78o!

Can you make sure the OpenAI module is fully updated by running pip install --upgrade openai?

After some research, byte 0xe9 can’t be decoded by UTF-8.

Sujits’s answer on StackOverflow:

As suggested by Mark Ransom, I found the right encoding for that problem. The encoding was "ISO-8859-1" , so replacing open("u.item", encoding="utf-8") with open('u.item', encoding = "ISO-8859-1") will solve the problem.

There was a similar issue where there was fine-tune data that was not being properly encoded using the UTF-8 encoding inside the Python OpenAI module during the fine-tuning process and a similar error was thrown like you encountered which is why the module implicitly encodes in UTF-8 now.

Now it seems like the latin character ‘é’, byte 0xe9, is encoded differently using UTF-8 which is causing problems. We don’t want to restrict the fine-tune data to not be able to accept Latin characters so we need to come up with a different solution.

@madeleine and @boris, I suspect that GPT-3 doesn’t just accept UTF-8 encoded data, since ‘é’ maps to token 2634 when using the tokenizer tool. If what I suspect is true, then we actually will need to change implicitly using the UTF-8 encoding when handling fine-tune data in the OpenAI module and instead detect the encoding beforehand and pass the detected encoding to all relevant functions used during fine-tuning.

VertigoRay on the same StackOverflow post provided code that David Z originally came up with that allows us to detect the encoding of the bytes being passed in by using:

result = chardet.detect(rawdata)
char_encoding = result['encoding']

I could go ahead and branch the OpenAI module and submit the issue along with the proposed solution in GitHub, if that’s advised @boris and @madeleine.

4 Likes

Thanks buddy, I got this problem solved?

1 Like

Good to hear @conda78o! Did you make any changes to the files?

Yeah, I just changed the file format and it worked this time. :blush:

1 Like

Nice! What was the format before and after, if you don’t mind me asking? I’m going to do research to see if there’s a way to prevent that error from happening to others in the future, so details on what caused it and what fixed it would be super helpful! :smiley:

1 Like

Sure! Initially, I uploaded the file in CSV format, but I received a UnicodeDecodeError. Then I redesigned the files in JSON format, and it worked! However, I’m not sure what actually caused the error.

1 Like

@conda78o, it appears that the CSV file format was not encoded originally in UTF-8, so when it came across that byte, it flipped out and threw that error. By converting to JSON, it re-encoded the data in UTF-8 which allowed the data to be decoded properly

1 Like