How to fix UnicodeDecodeError on the OpenAi data preparation tool?

conda78o · February 20, 2022, 4:14am

Could anyone please tell me how to fix this problem. I was trying to prepare my dataset in a json format using the OpenAi data preparation tool (openai tools fine_tunes.prepare_data -f <LOCAL_FILE>), but I got the following message in my terminal. You kind answer will be much appreciated. Thanks!

File “pandas/_libs/parsers.pyx”, line 544, in pandas._libs.parsers.TextReader.cinit
File “pandas/_libs/parsers.pyx”, line 633, in pandas._libs.parsers.TextReader._get_header
File “pandas/_libs/parsers.pyx”, line 847, in pandas._libs.parsers.TextReader._tokenize_rows
File “pandas/_libs/parsers.pyx”, line 1952, in pandas._libs.parsers.raise_parser_error
UnicodeDecodeError: ‘utf-8’ codec can’t decode byte 0xe9 in position 7684: invalid continuation byte

DutytoDevelop · February 20, 2022, 4:48am

Welcome to the OpenAI community @conda78o!

Can you make sure the OpenAI module is fully updated by running pip install --upgrade openai?

After some research, byte 0xe9 can’t be decoded by UTF-8.

Sujits’s answer on StackOverflow:

As suggested by Mark Ransom, I found the right encoding for that problem. The encoding was "ISO-8859-1" , so replacing open("u.item", encoding="utf-8") with open('u.item', encoding = "ISO-8859-1") will solve the problem.

There was a similar issue where there was fine-tune data that was not being properly encoded using the UTF-8 encoding inside the Python OpenAI module during the fine-tuning process and a similar error was thrown like you encountered which is why the module implicitly encodes in UTF-8 now.

Now it seems like the latin character ‘é’, byte 0xe9, is encoded differently using UTF-8 which is causing problems. We don’t want to restrict the fine-tune data to not be able to accept Latin characters so we need to come up with a different solution.

@madeleine and @boris, I suspect that GPT-3 doesn’t just accept UTF-8 encoded data, since ‘é’ maps to token 2634 when using the tokenizer tool. If what I suspect is true, then we actually will need to change implicitly using the UTF-8 encoding when handling fine-tune data in the OpenAI module and instead detect the encoding beforehand and pass the detected encoding to all relevant functions used during fine-tuning.

VertigoRay on the same StackOverflow post provided code that David Z originally came up with that allows us to detect the encoding of the bytes being passed in by using:

result = chardet.detect(rawdata)
char_encoding = result['encoding']

I could go ahead and branch the OpenAI module and submit the issue along with the proposed solution in GitHub, if that’s advised @boris and @madeleine.

conda78o · February 21, 2022, 7:03pm

Thanks buddy, I got this problem solved?

DutytoDevelop · February 21, 2022, 7:47pm

Good to hear @conda78o! Did you make any changes to the files?

conda78o · February 21, 2022, 9:31pm

Yeah, I just changed the file format and it worked this time.

DutytoDevelop · February 22, 2022, 2:45am

Nice! What was the format before and after, if you don’t mind me asking? I’m going to do research to see if there’s a way to prevent that error from happening to others in the future, so details on what caused it and what fixed it would be super helpful!

conda78o · February 22, 2022, 4:56am

Sure! Initially, I uploaded the file in CSV format, but I received a UnicodeDecodeError. Then I redesigned the files in JSON format, and it worked! However, I’m not sure what actually caused the error.

DutytoDevelop · February 22, 2022, 7:28pm

@conda78o, it appears that the CSV file format was not encoded originally in UTF-8, so when it came across that byte, it flipped out and threw that error. By converting to JSON, it re-encoded the data in UTF-8 which allowed the data to be decoded properly

Topic		Replies	Views
Invalid file format - Encoding issue API	3	1626	October 1, 2023
SOLVED: Unable to generate file for fine-tuning in correct JSONL format API	7	8557	December 17, 2023
Finetunes.create: No such File object Documentation	4	1894	December 23, 2021
Invalid file format- Issues with encoding different languages and emojis in Fine Tuning Community gpt-4 , fine-tuning	0	121	August 5, 2024
Weird Error while finetuning API	27	6559	January 3, 2024

How to fix UnicodeDecodeError on the OpenAi data preparation tool?

Related topics