Creating JSONl File from doc file

Hello, I want to create a JSONL file for my dataset. I will be using gpt-3.5-turbo. and I want to upload my file for fine tuning. I have a doc file which contains data in this format -
Example -
{“messages”: [{“role”: “system”, “content”: “Hello, This is a Test Chatbot”},
{“role”: “user”, “content”: “Hi, What is the capital of Germany?”},
{“role”: “assistant”, “content”: “Berlin”}]}

{“messages”: [{“role”: “system”, “content”: “Hello, This is a Test Chatbot”},
{“role”: “user”, “content”: “What is the capital of India”},
{“role”: “assistant”, “content”: “New Delhi”}]}

My question is, how can I create JSONL file from this doc file. Is there any online converter that I can use?I tried one but it doesnt work well. Also, Why do I need to convert it? I can also just save it as jsonl extension because it already contains the data in json format.

Another question is, in the OpenAI documentation, It says the JSONL format is like below but its for babbage-002 and davinci-002 models. Also in other chats in this forum I saw other people are generating file in this format.
{“prompt”: “”, “completion”: “”}

but what is the format for gpt-3.5-turbo? In my opinion, this is the format for gpt-3.5-turbo
**{“messages”: [{“role”: “system”, “content”: “Hello, This is a Test Chatbot”}, **
**{“role”: “user”, “content”: “Hi, What is the capital of Germany?”}, **
{“role”: “assistant”, “content”: “Berlin”}]}

Welcome to the Community!

The general logic of your data is correct. This is the official format as per the documentation and which is applicable to chat completion models.

{"messages": [{"role": "system", "content": "Marv is a factual chatbot that is also sarcastic."}, {"role": "user", "content": "What's the capital of France?"}, {"role": "assistant", "content": "Paris, as if everyone doesn't know that already."}]}
{"messages": [{"role": "system", "content": "Marv is a factual chatbot that is also sarcastic."}, {"role": "user", "content": "Who wrote 'Romeo and Juliet'?"}, {"role": "assistant", "content": "Oh, just some guy named William Shakespeare. Ever heard of him?"}]}
{"messages": [{"role": "system", "content": "Marv is a factual chatbot that is also sarcastic."}, {"role": "user", "content": "How far is the Moon from Earth?"}, {"role": "assistant", "content": "Around 384,400 kilometers. Give or take a few, like that really matters."}]}

In the final JSONL file, every training example should represent one line. Currently, you have line breaks in your data. This will cause issues. Hence, prior to converting it to a JSONL file, you need to remove these line breaks.

For JSONL file itself, you can just use a code editor such as Visual Studio Code. Once your line breaks are removed, you can just paste the data and save it as a JSONL file.

Hello, I was going through the above chat and came to know that there are different fine-tuning JSONL formats for chat models(davinci, babbage, text-davinci-003, etc) and different JSONL for completion models(gpt-4 models, gpt-3.5-turbo).
1. Please confirm on this.
2. Also please confirm that each line in JSONL in both the formats should not be more than 2 GB. If it exceeds, has to be in next line.
3. It is very difficult to ensure the above line size specifications. Is there any online tool which can format the given raw JSONL and helps complying to the said specification?

1 Like

Yes, the requirements for the JSONL file depend on the type of model you are using, i.e. whether you are using a chat completions model such as gpt-4o, gpt-4o-mini or gpt-3.5-turbo or regular completions models such as babbage-002anddavinci-002. You can find the requirements here.


Your training data needs to be in a single JSONL file. The following applies for the size of the file and your upload options.

Your training files might get quite large. You can upload files up to 8 GB in multiple parts using the Uploads API as opposed to the Files API, which only allows file uploads of up to 512 MB.


I have personally never used any third-party tools to create the data set. But there are few resources that you may consider. OpenAI has just last week released a new distillation capability that allows you to create fine-tuning datasets very easily from stored completions. You can read up on this here.

There’s also an OpenAI cookbook that helps you validate your training data set, which you may find helpful.


Let us know if you have any further questions. For additional details, you can also take a look at the detailed fine-tuning guide.