Could anyone please tell me how to fix this problem. I was trying to prepare my dataset in a json format using the OpenAi data preparation tool (openai tools fine_tunes.prepare_data -f <LOCAL_FILE>), but I got the following message in my terminal. You kind answer will be much appreciated. Thanks!
File “pandas/_libs/parsers.pyx”, line 544, in pandas._libs.parsers.TextReader.cinit
File “pandas/_libs/parsers.pyx”, line 633, in pandas._libs.parsers.TextReader._get_header
File “pandas/_libs/parsers.pyx”, line 847, in pandas._libs.parsers.TextReader._tokenize_rows
File “pandas/_libs/parsers.pyx”, line 1952, in pandas._libs.parsers.raise_parser_error
UnicodeDecodeError: ‘utf-8’ codec can’t decode byte 0xe9 in position 7684: invalid continuation byte
Welcome to the OpenAI community @conda78o!
Can you make sure the OpenAI module is fully updated by running
pip install --upgrade openai?
After some research, byte 0xe9 can’t be decoded by UTF-8.
Sujits’s answer on StackOverflow:
As suggested by Mark Ransom, I found the right encoding for that problem. The encoding was
"ISO-8859-1" , so replacing
open("u.item", encoding="utf-8") with
open('u.item', encoding = "ISO-8859-1") will solve the problem.
There was a similar issue where there was fine-tune data that was not being properly encoded using the UTF-8 encoding inside the Python OpenAI module during the fine-tuning process and a similar error was thrown like you encountered which is why the module implicitly encodes in UTF-8 now.
Now it seems like the latin character ‘é’, byte 0xe9, is encoded differently using UTF-8 which is causing problems. We don’t want to restrict the fine-tune data to not be able to accept Latin characters so we need to come up with a different solution.
@madeleine and @boris, I suspect that GPT-3 doesn’t just accept UTF-8 encoded data, since ‘é’ maps to token 2634 when using the tokenizer tool. If what I suspect is true, then we actually will need to change implicitly using the UTF-8 encoding when handling fine-tune data in the OpenAI module and instead detect the encoding beforehand and pass the detected encoding to all relevant functions used during fine-tuning.
VertigoRay on the same StackOverflow post provided code that David Z originally came up with that allows us to detect the encoding of the bytes being passed in by using:
result = chardet.detect(rawdata)
char_encoding = result['encoding']
I could go ahead and branch the OpenAI module and submit the issue along with the proposed solution in GitHub, if that’s advised @boris and @madeleine.
Thanks buddy, I got this problem solved?
Good to hear @conda78o! Did you make any changes to the files?
Yeah, I just changed the file format and it worked this time.
Nice! What was the format before and after, if you don’t mind me asking? I’m going to do research to see if there’s a way to prevent that error from happening to others in the future, so details on what caused it and what fixed it would be super helpful!
Sure! Initially, I uploaded the file in CSV format, but I received a UnicodeDecodeError. Then I redesigned the files in JSON format, and it worked! However, I’m not sure what actually caused the error.
@conda78o, it appears that the CSV file format was not encoded originally in UTF-8, so when it came across that byte, it flipped out and threw that error. By converting to JSON, it re-encoded the data in UTF-8 which allowed the data to be decoded properly