Invalid file format- Issues with encoding different languages and emojis in Fine Tuning

zainab.haikal1998 · August 5, 2024, 10:30am

Hi,

I am experiencing some issues with fine-tuning a model using OpenAI. My dataset includes various characters from different languages such as Japanese, Chinese, etc., as well as emojis. While the data works well when saved using UTF-8 encoding, I encounter an ‘The job failed due to an invalid training file. Invalid file format’.

Here is the code I am using to read and process my JSONL file:
import json

with open(“fine_tune_dataset_v1.jsonl”, “r”, encoding=‘utf-8’) as file:
for line in file:
try:
item = json.loads(line)
encode_message_prompt = cl100k.tokenizer.encode(item[‘prompt’])
encode_message_completion = cl100k.tokenizer.encode(item[‘completion’])
except json.JSONDecodeError as e:
print(f"Error decoding JSON: {e}“)
except Exception as e:
print(f"An unexpected error occurred: {e}”)

Here’s a sample of the problematic data:
Query:
User: 바이더알 매니아 여러분께 선보이는 커스텀 라이더 자켓입니다.
커스텀용 나사 스터드 F type (30 mm) 16개입니다.
asymmetric zip closure, side pockets.
Very ByTheR-ish fine quality piece ever!
Takes 3~5 working days to do the stud job!? Bot:
Questions:

How can I ensure the JSONL file maintains proper UTF-8 encoding and is accepted by the OpenAI playground?
Are there any specific steps or encoding settings I should be aware of during the fine-tuning process to handle these characters correctly?
Any insights or solutions on how to resolve the invalid JSON error when uploading the file would be greatly appreciated.

Topic		Replies	Views
Invalid file format - Encoding issue API	3	1538	October 1, 2023
Unicode in jsonl dataset for fine tuning API fine-tuning , fine-tune	4	361	September 21, 2024
SOLVED: Unable to generate file for fine-tuning in correct JSONL format API	7	8304	December 17, 2023
Error 'invalid training_file' Fine-Tuning gpt-3.5-turbo-0613 API fine-tuning-problems	2	1619	December 15, 2023
An error occurred while processing file 'file-name' and it cannot be used for fine-tuning. Details may be available in the file's status_details API fine-tuning , fine-tuning-problems	6	1894	September 18, 2023

Invalid file format- Issues with encoding different languages and emojis in Fine Tuning

Related topics