Welcome to the OpenAI community @conda78o!
Can you make sure the OpenAI module is fully updated by running pip install --upgrade openai
?
After some research, byte 0xe9 can’t be decoded by UTF-8.
Sujits’s answer on StackOverflow:
As suggested by Mark Ransom, I found the right encoding for that problem. The encoding was "ISO-8859-1"
, so replacing open("u.item", encoding="utf-8")
with open('u.item', encoding = "ISO-8859-1")
will solve the problem.
There was a similar issue where there was fine-tune data that was not being properly encoded using the UTF-8 encoding inside the Python OpenAI module during the fine-tuning process and a similar error was thrown like you encountered which is why the module implicitly encodes in UTF-8 now.
Now it seems like the latin character ‘é’, byte 0xe9, is encoded differently using UTF-8 which is causing problems. We don’t want to restrict the fine-tune data to not be able to accept Latin characters so we need to come up with a different solution.
@madeleine and @boris, I suspect that GPT-3 doesn’t just accept UTF-8 encoded data, since ‘é’ maps to token 2634 when using the tokenizer tool. If what I suspect is true, then we actually will need to change implicitly using the UTF-8 encoding when handling fine-tune data in the OpenAI module and instead detect the encoding beforehand and pass the detected encoding to all relevant functions used during fine-tuning.
VertigoRay on the same StackOverflow post provided code that David Z originally came up with that allows us to detect the encoding of the bytes being passed in by using:
result = chardet.detect(rawdata)
char_encoding = result['encoding']
I could go ahead and branch the OpenAI module and submit the issue along with the proposed solution in GitHub, if that’s advised @boris and @madeleine.