Invalid file format - Encoding issue

Hey, I have created a jsonl dataset to fine tune a gpt-3.5-turbo-0613 model, but when I try to upload it, the upload fails.

{
    "success": true,
    "status": 200,
    "data": {
        "id": "file-CkosqemZyAqUVI7zrCHxD8XW",
        "filename": "training-dataset.jsonl",
        "size": 43189,
        "status": "error",
        "statusDetails": "Invalid file format. Example 1'utf-8' codec can't decode byte 0xe9 in position 11278: invalid continuation byte",
        "createdAt": "2023-09-30T16:30:04.000Z"
    }
}

I believe the issue is related with my encoding, but its a bit weird because I can view the dataset without any issue anywhere else. Potentially, the upload fails because I am using some latin characters, but I don’t want to remove them because I need them in the dataset.

Any suggestion how to bypass this issue?

Sure, don’t put control code bytes like 0xe9 in your training file that don’t map to characters? You can load into a code-friendly editor like notepad++ and find what is at the mentioned byte position.

okay, so the issue is related with character é (0xe9), but I don’t want to remove it from the training file. Words within the dataset with such characters play an important role for me.

Since utf-8 is the default encoding system for most files, I was expecting the openai endpoint to support it without any issue. Is there a plan to fix this issue? If not, are there any alternative encoding systems that I can try out?

a bit unusual.

You can use escape sequence for it

You can represent the glyph differently


unicode forum test of python output:

> >>> print("\u0065\u0301")
> é
> >>> print("\U00E9")
> é
> >>>

The first is “e” + “accent”.
They are a bit different, the last is indicated on by my spellcheck.