I believe the issue is related with my encoding, but its a bit weird because I can view the dataset without any issue anywhere else. Potentially, the upload fails because I am using some latin characters, but I don’t want to remove them because I need them in the dataset.
Sure, don’t put control code bytes like 0xe9 in your training file that don’t map to characters? You can load into a code-friendly editor like notepad++ and find what is at the mentioned byte position.
okay, so the issue is related with character é (0xe9), but I don’t want to remove it from the training file. Words within the dataset with such characters play an important role for me.
Since utf-8 is the default encoding system for most files, I was expecting the openai endpoint to support it without any issue. Is there a plan to fix this issue? If not, are there any alternative encoding systems that I can try out?