The sentence is correct if you parse it correctly, which can be improved with punctuation and emphasis to reflect how it would be spoken.
Meaning: we perform the same action as we expect the model (to perform).
AI improvements (incomplete)
Fine-tuning
Fine-tune models for better results and efficiency.
Fine-tuning lets you get more out of the models available through the API by providing:
- Higher quality results than prompting
- Ability to train on more examples than can fit in a prompt
- Token savings due to shorter prompts
- Lower latency requests
OpenAI’s text generation models have been pre-trained on a vast amount of text. To use the models effectively, we include instructions and sometimes several examples in a prompt. Using demonstrations to show how to perform a task is often called “few-shot learning.”
Fine-tuning improves on few-shot learning by training on many more examples than can fit in the prompt, letting you achieve better results on a wide number of tasks. Once a model has been fine-tuned, you won’t need to provide as many examples in the prompt. This saves costs and enables lower-latency requests.
At a high level, fine-tuning involves the following steps:
- Prepare and upload training data
- Train a new fine-tuned model
- Evaluate results and, if needed, return to step 1
- Use your fine-tuned model
Visit our pricing page to learn more about how fine-tuned model training and usage are billed.
Which models can be fine-tuned?
Fine-tuning is currently available for the following models:
gpt-4o-2024-08-06
gpt-4o-mini-2024-07-18
gpt-4-0613
gpt-3.5-turbo-0125
gpt-3.5-turbo-1106
gpt-3.5-turbo-0613
You can also fine-tune a fine-tuned model, which is useful if you acquire additional data and don’t want to repeat the previous training steps.
We expect gpt-4o-mini
to be the right model for most users in terms of performance, cost, and ease of use.
When to use fine-tuning
Fine-tuning OpenAI text generation models can make them better for specific applications, but it requires a careful investment of time and effort. We recommend first attempting to get good results with prompt engineering, prompt chaining (breaking complex tasks into multiple prompts), and function calling, for these reasons:
- Many tasks can be significantly improved with a better prompt, so fine-tuning might not be necessary.
- Iterating over prompts is faster than iterating with fine-tuning, which requires creating datasets and running training jobs.
- Even if you choose to fine-tune later, initial prompt engineering work provides a better baseline for your training data (and can be combined with fine-tuning).
Our prompt engineering guide covers some of the most effective ways to improve performance without fine-tuning. You can quickly iterate on prompts in our playground.
Common use cases
Some use cases where fine-tuning can improve results:
- Setting a particular style, tone, format, or other qualitative aspect.
- Improving reliability at producing a desired output.
- Correcting failures to follow complex prompts.
- Handling many edge cases in specific ways.
- Performing a new skill or task that’s hard to articulate in a prompt.
A high-level way to think about these cases is when it’s easier to “show, not tell.” We will explore how to set up data for fine-tuning and cover examples where fine-tuning improves performance beyond what the baseline model can achieve.
Another scenario where fine-tuning is effective is reducing cost and/or latency by replacing a more expensive model like gpt-4o
with a fine-tuned gpt-4o-mini
model. If you can achieve good results with gpt-4o
, you can often approach that same quality by fine-tuning gpt-4o-mini
on the responses of the larger model, sometimes with a shorter instruction prompt.
Preparing your dataset
Once you decide fine-tuning is the right solution (i.e., you’ve optimized your prompt and identified issues the model still has), the next step is preparing your training data. Create a diverse set of demonstration conversations similar to what the model will see at inference time.
Each example in the dataset should be a conversation in the same format used by our Chat Completions API, specifically a list of messages where each message has a role and content (plus an optional name). Include examples that cover edge cases where the default model struggles, and pair them with the ideal assistant response you want.
Example format
Below is a simple example for training a chatbot to respond in a sarcastic style. Each line is a JSON object with a messages
array:
{"messages": [{"role": "system", "content": "Marv is a factual chatbot that is also sarcastic."}, {"role": "user", "content": "What's the capital of France?"}, {"role": "assistant", "content": "Paris, as if everyone doesn't know that already."}]}
{"messages": [{"role": "system", "content": "Marv is a factual chatbot that is also sarcastic."}, {"role": "user", "content": "Who wrote 'Romeo and Juliet'?"}, {"role": "assistant", "content": "Oh, just some guy named William Shakespeare. Ever heard of him?"}]}
{"messages": [{"role": "system", "content": "Marv is a factual chatbot that is also sarcastic."}, {"role": "user", "content": "How far is the Moon from Earth?"}, {"role": "assistant", "content": "Around 384,400 kilometers. Give or take a few, like that really matters."}]}
Multi-turn chat examples
Fine-tuning supports multiple assistant
messages within a single example. By default, all assistant messages will be used for training. If you want to exclude certain assistant messages from being learned, add "weight": 0
to those messages:
{"messages": [
{"role": "system", "content": "Marv is a factual chatbot that is also sarcastic."},
{"role": "user", "content": "What's the capital of France?"},
{"role": "assistant", "content": "Paris", "weight": 0},
{"role": "user", "content": "Can you be more sarcastic?"},
{"role": "assistant", "content": "Paris, as if everyone doesn't know that already.", "weight": 1}
]}
In this case, the model will only learn from the assistant response that includes sarcasm ("weight": 1
), and ignore the plain “Paris” response.
Crafting prompts
We generally recommend taking the best instructions and prompts you’ve used with the model before fine-tuning and embedding them in every training example. This yields the best and most general results, especially if you have fewer than 100 training examples.
Shortening repeated instructions
If you shorten or remove instructions in your training examples (to save cost), remember that the model still “learns” those instructions as if they were always present. This can make it difficult to get the model to ignore those “baked-in” instructions later if you want to deviate from them at inference.
Example count recommendations
You must provide at least 10 examples. We typically see clear improvements with 50–100 training examples for gpt-4o-mini
and gpt-3.5-turbo
, but optimal numbers can vary greatly by task.
A good approach is to start with around 50 well-crafted demonstrations. If the model improves in the desired direction, you can add more data to further improve performance. If there’s no improvement, re-check your data strategy before scaling up.
Train and test splits
Split your dataset into training and test portions. When you submit a job with both training and test files, the system will display metrics for both. This is your initial signal for how well the model is learning. Maintaining a dedicated test set also helps you independently verify performance after training (e.g., using Evals).
Token limits
Token limits differ by model. Below is a summary of inference and training context lengths:
Model |
Inference context length |
Training examples context length |
gpt-4o-2024-08-06 |
128,000 tokens |
65,536 tokens (128k coming soon) |
gpt-4o-mini-2024-07-18 |
128,000 tokens |
65,536 tokens (128k coming soon) |
gpt-3.5-turbo-0125 |
16,385 tokens |
16,385 tokens |
gpt-3.5-turbo-1106 |
16,385 tokens |
16,385 tokens |
gpt-3.5-turbo-0613 |
16,385 tokens |
4,096 tokens |
Examples exceeding the maximum context will be truncated from the end. You can use our token counting notebook to verify token usage.
Estimate costs
For detailed pricing on training, input tokens, and output tokens for your fine-tuned model, see our pricing page.
You can estimate the cost of a single fine-tuning job using:
Total cost = (base training cost per 1M input tokens ÷ 1M)
× number of tokens in the input file
× number of epochs
Example
A 100,000-token file trained for 3 epochs:
~$0.90 USD
with gpt-4o-mini-2024-07-18
after the free period ends on October 31, 2024.
~$2.40 USD
with gpt-3.5-turbo-0125
.
Check data formatting
Before you create a fine-tuning job, use our validation script to catch potential errors, measure token counts, and estimate costs:
Fine-tuning data format validation
Learn about fine-tuning data formatting
Upload a training file
Once your data is validated, upload the file via the Files API. Then you can reference this file in your fine-tuning job creation request.
Below is an example in JavaScript for creating a fine-tuning job with Direct Preference Optimization (DPO). You can omit the method
parameter (or set type: "supervised"
) if you want standard supervised fine-tuning (SFT):
import OpenAI from "openai";
const openai = new OpenAI();
const job = await openai.fineTuning.jobs.create({
training_file: "file-all-about-the-weather",
model: "gpt-4o-2024-08-06",
method: {
type: "dpo",
dpo: {
hyperparameters: { beta: 0.1 },
},
},
});
from openai import OpenAI
client = OpenAI()
job = client.fine_tuning.jobs.create(
training_file="file-all-about-the-weather",
model="gpt-4o-2024-08-06",
method={
"type": "dpo",
"dpo": {
"hyperparameters": {"beta": 0.1},
},
},
)
While the file is processing, you can still create a fine-tuning job; it simply will not start until file processing completes.
Size limits and large uploads
- Max upload size for the Files API is 512 MB.
- You can upload files up to 8 GB in multiple parts using the Uploads API.
We recommend starting with smaller datasets first. You don’t need huge amounts of data to see improvements.
Create a fine-tuned model
After checking your dataset’s structure and uploading it, create a fine-tuning job. You can do this via the fine-tuning UI or programmatically. Below is an SDK example:
import OpenAI from "openai";
const openai = new OpenAI();
const fineTune = await openai.fineTuning.jobs.create({
training_file: 'file-abc123',
model: 'gpt-4o-mini-2024-07-18'
});
from openai import OpenAI
client = OpenAI()
client.fine_tuning.jobs.create(
training_file="file-abc123",
model="gpt-4o-mini-2024-07-18"
)
Note
Only specific model snapshots (e.g., gpt-4o-mini-2024-07-18
) can be fine-tuned, as listed in our supported models.
If you don’t specify a fine-tuning method
, it defaults to Supervised Fine-Tuning (SFT).
To set additional parameters (e.g., validation_file
, suffix
, hyperparameters
), see the fine-tuning create API docs.
It may take some time for your job to complete (minutes to hours, depending on queue and dataset size). The user who created the job will receive an email once training is finished.
Managing fine-tuning jobs
You can list existing jobs, retrieve job details, cancel a job, list job events, or delete a fine-tuned model:
import OpenAI from "openai";
const openai = new OpenAI();
// List 10 fine-tuning jobs
let page = await openai.fineTuning.jobs.list({ limit: 10 });
// Retrieve the state of a fine-tune
let fineTune = await openai.fineTuning.jobs.retrieve('ftjob-abc123');
// Cancel a job
let status = await openai.fineTuning.jobs.cancel('ftjob-abc123');
// List up to 10 events from a fine-tuning job
let events = await openai.fineTuning.jobs.listEvents(fineTune.id, { limit: 10 });
// Delete a fine-tuned model
let model = await openai.models.delete('ft:gpt-3.5-turbo:acemeco:suffix:abc123');
from openai import OpenAI
client = OpenAI()
# List 10 fine-tuning jobs
client.fine_tuning.jobs.list(limit=10)
# Retrieve the state of a fine-tune
client.fine_tuning.jobs.retrieve("ftjob-abc123")
# Cancel a job
client.fine_tuning.jobs.cancel("ftjob-abc123")
# List up to 10 events from a fine-tuning job
client.fine_tuning.jobs.list_events(fine_tuning_job_id="ftjob-abc123", limit=10)
# Delete a fine-tuned model
client.models.delete("ft:gpt-3.5-turbo:acemeco:suffix:abc123")
Use a fine-tuned model
When a fine-tuning job succeeds, the fine_tuned_model
field will be populated in the job details. Specify this model in the Chat Completions API or select it in the Playground.
It can take a few minutes for the model to fully load. If requests time out or the model name is not found, wait briefly and retry.
from openai import OpenAI
client = OpenAI()
completion = client.chat.completions.create(
model="ft:gpt-4o-mini:my-org:custom_suffix:id",
messages=[
{"role": "system", "content": "You are a helpful assistant."},
{"role": "user", "content": "Hello!"}
]
)
print(completion.choices[0].message)
You can begin making requests to your fine-tuned model just like any other model.
Use a checkpointed model
OpenAI automatically creates one checkpoint per epoch during fine-tuning. These checkpoints are also valid models you can use.
- Wait until a job succeeds (retrieve the job status).
- List checkpoints to find the model checkpoint names.
Each checkpoint includes the step_number
and metrics
at that point in training. Currently, only the last three epochs’ checkpoints are saved.
Analyzing your fine-tuned model
We provide these training metrics:
training loss
training token accuracy
valid loss
valid token accuracy
valid loss
and valid token accuracy
are computed in two ways: on a small batch periodically and on the full validation set at the end of each epoch. The full validation metrics are typically more accurate.
To see metrics while training, you can look at the job’s event objects. After completion, you can see final metrics by retrieving the result_files
from the finished job, then downloading the CSV.
Iterating on data quality
If results are disappointing:
- Add targeted examples for areas where performance is lacking.
- Check existing examples for grammar, logic, and style issues.
- Ensure data balance matches expected real-world usage (e.g., not too many refusal messages if refusals are rare in practice).
- Confirm each example contains all necessary context for the desired response.
- Resolve inconsistencies (e.g., multiple annotators might produce conflicting responses for the same input).
- Ensure consistent formatting across all training examples.
Iterating on data quantity
Once you’re happy with data quality and distribution, you can add more examples. This typically improves performance further, especially around edge cases. To estimate benefits from scaling, try fine-tuning on half your dataset vs. the full dataset and observe the difference.
Iterating on hyperparameters
You can customize:
epochs
learning rate multiplier
batch size
We recommend using defaults first. Then adjust:
- Increase epochs (e.g., +1 or +2) if the model still underfits (e.g., doesn’t follow training data well for tasks with a single correct answer).
- Decrease epochs if the model becomes overly repetitive or conservative (common in creative tasks).
- Increase the learning rate multiplier if the model does not converge at all.
Setting hyperparameters
const fineTune = await openai.fineTuning.jobs.create({
training_file: "file-abc123",
model: "gpt-4o-mini-2024-07-18",
method: {
type: "supervised",
supervised: {
hyperparameters: { n_epochs: 2 },
},
},
});
from openai import OpenAI
client = OpenAI()
client.fine_tuning.jobs.create(
training_file="file-abc123",
model="gpt-4o-mini-2024-07-18",
method={
"type": "supervised",
"supervised": {
"hyperparameters": {"n_epochs": 2},
},
},
)
Vision fine-tuning
Fine-tuning is also possible with images in your JSONL files. Just as you can send one or many image inputs to chat completions, you can include those same message types within your training data. Images can be provided either as HTTP URLs or as data URLs containing base64-encoded images.
Below is an example snippet:
{
"messages": [
{
"role": "system",
"content": "You are an assistant that identifies uncommon cheeses."
},
{
"role": "user",
"content": "What is this cheese?"
},
{
"role": "user",
"content": [
{
"type": "image_url",
"image_url": {
"url": "https://upload.wikimedia.org/wikipedia/commons/3/36/Danbo_Cheese.jpg"
}
}
]
},
{
"role": "assistant",
"content": "Danbo"
}
]
}
Image dataset requirements
- Max 50,000 examples containing images (not including text-only).
- Each example can have at most 10 images, each up to 10 MB in size.
- Allowed image formats: JPEG, PNG, WEBP (RGB or RGBA).
- No images can appear in an
assistant
role.
Any image with disallowed content or containing people, faces, children, or CAPTCHAs will be skipped.
Reducing training cost
Set "detail": "low"
in image_url
to reduce each image to 512×512 internally for training—represented by only 85 tokens. See Vision docs for details.
Preference fine-tuning
Direct Preference Optimization (DPO) fine-tuning allows you to train on pairs of responses labeled as “preferred” or “non-preferred.” This helps the model learn from human preferences and produce more favored outputs. (Text-only DPO is currently supported.)
Preparing your dataset for DPO
Each JSONL line should contain:
- input — typically a user prompt with optional system messages or additional metadata.
- preferred_output — the “better” response.
- non_preferred_output — the suboptimal response.
{
"input": {
"messages": [
{
"role": "user",
"content": "Hello, can you tell me how cold San Francisco is today?"
}
],
"tools": [],
"parallel_tool_calls": true
},
"preferred_output": [
{
"role": "assistant",
"content": "Today in San Francisco, it is not that cold. High near 68°F (20°C), low around 57°F (14°C)."
}
],
"non_preferred_output": [
{
"role": "assistant",
"content": "It is not particularly cold in San Francisco today."
}
]
}
Currently, we only train on single-turn user→assistant interactions. The last assistant message must represent the preferred or non-preferred output.
Stacking methods: SFT + DPO
You can first do Supervised Fine-Tuning (SFT) on your preferred responses, then do a DPO job to further optimize the model for human preferences. This two-step approach can strengthen alignment.
Configuring a DPO fine-tuning job
Use the method
field in your request with "type": "dpo"
:
import OpenAI from "openai";
const openai = new OpenAI();
const job = await openai.fineTuning.jobs.create({
training_file: "file-all-about-the-weather",
model: "gpt-4o-2024-08-06",
method: {
type: "dpo",
dpo: {
hyperparameters: { beta: 0.1 },
},
},
});
from openai import OpenAI
client = OpenAI()
job = client.fine_tuning.jobs.create(
training_file="file-all-about-the-weather",
model="gpt-4o-2024-08-06",
method={
"type": "dpo",
"dpo": {
"hyperparameters": {"beta": 0.1},
},
},
)
beta
controls how strictly the new model will stick to old behavior (higher = more conservative) vs. adopting new preferences (lower = more aggressive). The default is "auto"
.
Fine-tuning examples
Now let’s walk through some example workflows end-to-end.
Style and tone
Suppose you want your model to adopt a specific style or tone—say, sarcastic. You’ve tried prompt engineering, but you want more consistency with fewer tokens. Below is a short example set (remember you need at least 10 total for an actual job).
{"messages": [
{"role": "system", "content": "Marv is a factual chatbot that is also sarcastic."},
{"role": "user", "content": "What's the capital of France?"},
{"role": "assistant", "content": "Paris, as if everyone doesn't know that already."}
]}
{"messages": [
{"role": "system", "content": "Marv is a factual chatbot that is also sarcastic."},
{"role": "user", "content": "Who wrote 'Romeo and Juliet'?"},
{"role": "assistant", "content": "Oh, just some guy named William Shakespeare. Ever heard of him?"}
]}
{"messages": [
{"role": "system", "content": "Marv is a factual chatbot that is also sarcastic."},
{"role": "user", "content": "How far is the Moon from Earth?"},
{"role": "assistant", "content": "Around 384,400 kilometers. Give or take a few, like that really matters."}
]}
Optionally, you can introduce intentional errors or misspellings in your dataset if your goal is, for example, teaching the model to handle or correct them. For instance:
{"messages": [
{"role": "system", "content": "Marv is a chatbot that corrects commonly missspelled words and responds with mild sarcasm."},
{"role": "user", "content": "Who roote 'Romeo and Juliet'?"},
{"role": "assistant", "content": "Oh, you mean who *wrote* 'Romeo and Juliet'? That was William Shakespeare. Try to keep up."}
]}
(Notice the intentionally misspelled "missspelled"
and "roote"
to demonstrate how the model should correct them.)
Once your JSONL is ready and validated, upload it and start a fine-tuning job:
from openai import OpenAI
client = OpenAI()
file = client.files.create(
file=open("marv.jsonl", "rb"),
purpose="fine-tune"
)
client.fine_tuning.jobs.create(
training_file=file.id,
model="gpt-4o-mini-2024-07-18"
)
When the job finishes, you can use your fine-tuned model.
Structured output
Another use case is generating structured JSON from unstructured text—for example, extracting sports data from headlines. Each line in your JSONL could look like this:
{"messages": [
{
"role": "system",
"content": "Given a sports headline, provide the following fields in a JSON dict: 'player' (full name), 'team', 'sport', 'gender' (if known)."
},
{
"role": "user",
"content": "Sources: Colts grant RB Taylor OK to seek trade"
},
{
"role": "assistant",
"content": "{\"player\": \"Jonathan Taylor\", \"team\": \"Colts\", \"sport\": \"football\", \"gender\": \"male\"}"
}
]}
{"messages": [
{
"role": "system",
"content": "Given a sports headline, provide the following fields in a JSON dict: 'player' (full name), 'team', 'sport', 'gender' (if known)."
},
{
"role": "user",
"content": "OSU 'split down middle' on starting QB battle"
},
{
"role": "assistant",
"content": "{\"player\": null, \"team\": \"OSU\", \"sport\": \"football\", \"gender\": null}"
}
]}
Upload and fine-tune just as above. Then you can query it:
completion = client.chat.completions.create(
model="ft:gpt-4o-mini:my-org:custom_suffix:id",
messages=[
{
"role": "system",
"content": "Given a sports headline, provide the following fields in a JSON dict: player, team, sport, and gender"
},
{
"role": "user",
"content": "Richardson wins 100m at worlds to cap comeback"
}
]
)
print(completion.choices[0].message)
Expected response:
{
"player": "Sha'Carri Richardson",
"team": null,
"sport": "track and field",
"gender": "female"
}
Tool and function calling
OpenAI’s Chat Completion API supports calling external “functions” or “tools” to handle certain requests, such as looking up weather info or retrieving database records. Including a large definition for each tool/function in every prompt can waste tokens and sometimes cause hallucinations.
By fine-tuning on these tool-calling patterns, you can often:
- Save tokens by omitting or shortening full definitions in the prompt, while still getting correct calls.
- Improve consistency by teaching the model the precise JSON format (or key-value arguments) you want.
Example Format
We have two approaches for specifying callables:
- Tools (the recommended approach going forward).
- Function calling (legacy approach;
function_call
& functions
are deprecated).
Both approaches are nearly identical in concept. The main difference is naming:
- Tool calling examples use
"tool_calls": [...]
for the assistant’s response and a "tools": [...]
array for definitions.
- Function calling examples use
"function_call": { ... }
for the assistant’s response and a "functions": [...]
array for definitions.
If you want to maintain compatibility with older code, you can use the function calling style. If you’re starting fresh, we recommend using the Tool calling style.
Unifying the two examples
Below, we show one combined example. Where they differ, we’ll point it out.
Tool calling format
{
"messages": [
{
"role": "user",
"content": "What is the weather in San Francisco?"
},
{
"role": "assistant",
"tool_calls": [
{
"id": "call_id",
"type": "function",
"function": {
"name": "get_current_weather",
"arguments": "{\"location\": \"San Francisco, USA\", \"format\": \"celsius\"}"
}
}
]
}
],
"tools": [
{
"type": "function",
"function": {
"name": "get_current_weather",
"description": "Get the current weather",
"parameters": {
"type": "object",
"properties": {
"location": {
"type": "string",
"description": "The city and country, e.g. San Francisco, USA"
},
"format": { "type": "string", "enum": ["celsius", "fahrenheit"] }
},
"required": ["location", "format"]
}
}
}
]
}
Function calling format (legacy)
{
"messages": [
{
"role": "user",
"content": "What is the weather in San Francisco?"
},
{
"role": "assistant",
"function_call": {
"name": "get_current_weather",
"arguments": "{\"location\": \"San Francisco, USA\", \"format\": \"celsius\"}"
}
}
],
"functions": [
{
"name": "get_current_weather",
"description": "Get the current weather",
"parameters": {
"type": "object",
"properties": {
"location": {
"type": "string",
"description": "The city and country, e.g. San Francisco, USA"
},
"format": { "type": "string", "enum": ["celsius", "fahrenheit"] }
},
"required": ["location", "format"]
}
}
]
}
Tip
If you want to strictly minimize tokens, you could remove or shorten function/parameter descriptions in your training data. However, removing them may lower the model’s success rate for the correct arguments.
Including tool/function responses
Sometimes, you also want the model to incorporate the tool’s response. To teach this explicitly, include a final assistant message that interprets the tool’s output:
{
"messages": [
{
"role": "user",
"content": "What is the weather in San Francisco?"
},
{
"role": "assistant",
"tool_calls": [
{
"id": "call_id",
"type": "function",
"function": {
"name": "get_current_weather",
"arguments": "{\"location\": \"San Francisco, USA\", \"format\": \"celsius\"}"
}
}
]
},
{
"role": "tool",
"tool_call_id": "call_id",
"content": "21.0"
},
{
"role": "assistant",
"content": "It is 21°C in San Francisco, CA."
}
],
"tools": [
{
"type": "function",
"function": {
"name": "get_current_weather",
"description": "Get the current weather",
"parameters": {
"type": "object",
"properties": {
"location": {
"type": "string"
},
"format": { "type": "string", "enum": ["celsius", "fahrenheit"] }
},
"required": ["location", "format"]
}
}
}
]
}
(Replace "tool_calls"
and "tools"
with "function_call"
and "functions"
respectively if you prefer the older function calling syntax.)
(snip!)