Hi @jason123 - here’s some input based on my understanding.
It appears to be non-cumulative.
Your training data needs to be in the form of a JSONL file. Please see the examples here and here for specific data formatting. The latter is specifically for multi-turn conversations. There is also a Python script that you can use to validate the formatting of your JSONL file.
The following token limits apply for a given training example for gpt-4o-mini:
I’m assuming they are stored just like any other fine-tuned model. It is not explicitly stated if there is a time limit to using them but I assume no. After all, you still incur costs for consuming the fine-tuned models. Only the training is free of cost.
I would expect that the same data retention policies apply and that data sent via the API is not used for training. That said, data is still retained. As per the below table, fine-tuning files are retained until deleted (Source: https://platform.openai.com/docs/models/how-we-use-your-data).
Thank you very much for your detailed response. I wonder if you have recently fine-tuned the GPT-4o mini model?
I saw some fine-tuning example programs from others, and they look simple, but when it comes to actually doing it myself, I hesitate and worry about various aspects. Link to fine-tuning example
I wonder if the training time for GPT-4o mini is long? If I leave the program running, does that mean I can’t shut down my local computer with the VS Code editor? If OpenAI trains the model for me, where will the model be stored, and will I be notified by email?
Thank you for previously answering my question about token limits. Link to token limits
According to the information, the training context limit for GPT-4o mini is 65,536 tokens. How exactly is this limit defined? I downloaded the training example toy_chat_fine_tuning.jsonl, which contains multiple rows of message training data. So is the 65,536 tokens limit applied to each message, or is it for the entire JSONL file?
The time to fine-tune a model can be anywhere from a few minutes to several hours. It depends on the size of your training data and, if applicable, validation data set, as well as the current “fine-tuning traffic”.
Once you have created the fine-tuning job, the process runs on OpenAI’s end and you will receive an email once it is completed. Hence, you can shut down your computer in the meantime.
The limit applies to a single message in the training data.
Does the example in the above resource cover all the relevant content related to fine-tuning? However, it seems that this fine-tuning guide does not explain how to set the number of epochs or how to fine-tune multiple times. It does mention the use of a validation dataset, though.
Regarding the format of the training data, can I directly organize it into a .jsonl format like this? And can the jsonl file contain multiple messages, with each message having a maximum of 65,536 tokens (including those tag structures like role, content, assistant, etc.)?
Can training data include a weight to allow the AI to determine whether to learn from it? Is this feature useful? Can the weight only be set to 0 or 1?
Thank you.
{"messages 1": [{"role": "system", "content": "You are a happy assistant that puts a positive spin on everything."}, {"role": "user", "content": "I fell off my bike today."}, {"role": "assistant", "content": "It's great that you're getting exercise outdoors!"}]}
{"messages 2": [{"role": "system", "content": "You are a happy assistant that puts a positive spin on everything."}, {"role": "user", "content": "I lost my tennis match today."}, {"role": "assistant", "content": "It's ok, it happens to everyone."}, {"role": "user", "content": "But I trained so hard!"}, {"role": "assistant", "content": "It will pay off next time."}, {"role": "user", "content": "I'm going to switch to golf."}, {"role": "assistant", "content": "Golf is fun too!"}, {"role": "user", "content": "I don't even know how to play golf."}, {"role": "assistant", "content": "It's easy to learn!"}]}
{"***messages 3***": [{"role": "user", "content": "I lost my book today."}, {"role": "assistant", "content": "You can read everything on ebooks these days!"}]}
{"messages 4": [{"role": "system", "content": "You are a happy assistant that puts a positive spin on everything."}, {"role": "assistant", "content": "You're great!"}]}
{"messages 5": [{"role": "system", "content": "You are a happy assistant that puts a positive spin on everything."}, {"role": "user", "content": "I'm hungry."}, {"role": "assistant", "content": "Eat a banana!.........
In terms of overall guidance, I would use the following OpenAI documentation as a starting point. It’s really grown over time and provides a pretty comprehensive overview of fine-tuning: https://platform.openai.com/docs/guides/fine-tuning
Additionally, OpenAI also recently added this example to its cookbook collection in the context of the launch of the gpt-4o-mini fine-tuning: How to fine-tune chat models | OpenAI Cookbook
As for your specific questions:
You can but don’t have to specify the number of epochs. By default it automatically determines the most suitable number based on your data set.
You can fine-tune an already fine-tuned model. The only change you have to make is to reference the fine-tuned model when you specify which model to use for the fine-tuning. Everything else remains the same.
Yes, your data set needs to be in a JSONL file and strictly confirm to format as shown in the guidance. Your understanding is correct that the JSONL file holds multiple messages/examples (in fact all the messages in your training data). If you create a validation data set, you would need to save that in a separate JSONL file.
As for the token limit, just to avoid any confusion: One complete training example including the different messages for system, user, assistant can be up to 65,536 tokens in the case of gpt-4o-mini. If you were to fine-tune on multi-turn conversations, all your messages also would need to stay below this limit.
This entirely depends on your use case. You are definitely not required to include a system message. However, if you do include one as part of the training, then you also need to include the system message later on when you consume your fine-tuned model. Unlike in the example, I would normally stick to a consistent approach for a given training data set, i.e. either include it for all examples or for none.
Yes, this feature is available for multi-turn chat conversation with the available weights indeed being either 0 or 1 (see here). I have not personally used it so far, so can’t speak to its effectiveness.
Thank you very much. May I ask what this part means? Does it mean that after using OpenAI’s computing power to obtain a trained model ID, I can use this model ID to train the model again?
Yes. Very practically speaking, when you create your fine-tuning job, you just reference the fine-tuned model - see as an example below:
response = client.fine_tuning.jobs.create(
training_file="file-GzKKd8XW9kcj8MWhxxxxxxx",
model="ft:gpt-3.5-turbo-0125:organizationname::9V3LKxxx" // replace with your fine-tuned model
)
This is mostly useful when you want to add additional training examples and/or make other minor corrections.
If you are looking to make material changes to the way the fine-tuned model operates, then it is advisable to create a new fine-tuned model instead.