Invalid file format - Encoding issue

correctify · September 30, 2023, 4:42pm

Hey, I have created a jsonl dataset to fine tune a gpt-3.5-turbo-0613 model, but when I try to upload it, the upload fails.

{
    "success": true,
    "status": 200,
    "data": {
        "id": "file-CkosqemZyAqUVI7zrCHxD8XW",
        "filename": "training-dataset.jsonl",
        "size": 43189,
        "status": "error",
        "statusDetails": "Invalid file format. Example 1'utf-8' codec can't decode byte 0xe9 in position 11278: invalid continuation byte",
        "createdAt": "2023-09-30T16:30:04.000Z"
    }
}

I believe the issue is related with my encoding, but its a bit weird because I can view the dataset without any issue anywhere else. Potentially, the upload fails because I am using some latin characters, but I don’t want to remove them because I need them in the dataset.

Any suggestion how to bypass this issue?

_j · September 30, 2023, 9:42pm

Sure, don’t put control code bytes like 0xe9 in your training file that don’t map to characters? You can load into a code-friendly editor like notepad++ and find what is at the mentioned byte position.

correctify · October 1, 2023, 6:06am

okay, so the issue is related with character é (0xe9), but I don’t want to remove it from the training file. Words within the dataset with such characters play an important role for me.

Since utf-8 is the default encoding system for most files, I was expecting the openai endpoint to support it without any issue. Is there a plan to fix this issue? If not, are there any alternative encoding systems that I can try out?

_j · October 1, 2023, 6:20am

a bit unusual.

You can use escape sequence for it

You can represent the glyph differently


unicode forum test of python output:

> >>> print("\u0065\u0301")
> é
> >>> print("\U00E9")
> é
> >>>

The first is “e” + “accent”.
They are a bit different, the last is indicated on by my spellcheck.

Topic		Replies	Views
Invalid file format- Issues with encoding different languages and emojis in Fine Tuning Community gpt-4 , fine-tuning	0	154	August 5, 2024
How to fix UnicodeDecodeError on the OpenAi data preparation tool? API	7	6010	February 22, 2022
Unicode in jsonl dataset for fine tuning API fine-tuning , fine-tune	4	551	September 21, 2024
File format error in the training file API fine-tuning	1	172	July 15, 2025
While using fine tuning, get this error:UnicodeEncodeError: API	3	3509	March 5, 2023

Invalid file format - Encoding issue

Related topics