Invalid file format- Issues with encoding different languages and emojis in Fine Tuning

Hi,

I am experiencing some issues with fine-tuning a model using OpenAI. My dataset includes various characters from different languages such as Japanese, Chinese, etc., as well as emojis. While the data works well when saved using UTF-8 encoding, I encounter an ‘The job failed due to an invalid training file. Invalid file format’.

Here is the code I am using to read and process my JSONL file:
import json

with open(“fine_tune_dataset_v1.jsonl”, “r”, encoding=‘utf-8’) as file:
for line in file:
try:
item = json.loads(line)
encode_message_prompt = cl100k.tokenizer.encode(item[‘prompt’])
encode_message_completion = cl100k.tokenizer.encode(item[‘completion’])
except json.JSONDecodeError as e:
print(f"Error decoding JSON: {e}“)
except Exception as e:
print(f"An unexpected error occurred: {e}”)

Here’s a sample of the problematic data:
Query:
User: 바이더알 매니아 여러분께 선보이는 커스텀 라이더 자켓입니다.
커스텀용 나사 스터드 F type (30 mm) 16개입니다.
asymmetric zip closure, side pockets.
Very ByTheR-ish fine quality piece ever!
Takes 3~5 working days to do the stud job!? Bot:
Questions:

  • How can I ensure the JSONL file maintains proper UTF-8 encoding and is accepted by the OpenAI playground?
  • Are there any specific steps or encoding settings I should be aware of during the fine-tuning process to handle these characters correctly?
  • Any insights or solutions on how to resolve the invalid JSON error when uploading the file would be greatly appreciated.