Frustrations with OpenAI's Documentation: Errors, Repetition, and Unresolved Issues

I am frustrated with the OpenAI docs. For example, in the fine-tuning section, there are multiple types of mistakes.

The first example is this: “what the model should which” → This phrase is incorrect and unclear. Additionally, in an example given directly below—intended to contain misspelled words for fine-tuning—it ironically contains no misspellings. In fact, it is the same example reused from earlier in the documentation.

Another issue that makes the docs unnecessarily long and repetitive is the redundancy in the tool-calling and function-calling examples under the fine-tuning section. The explanations for both are pretty much copy-pasted. If the practical implementation is so similar that the guide can be duplicated, why not combine them into a single example? A simple note could highlight the slight differences due to factor X, while emphasizing that the overall process for both is 99.9% identical.

These are just a few examples from the fine-tuning section alone. There are plenty more issues I haven’t listed, as I can’t recall their exact locations. I have tried reporting these mistakes to whoever is responsible for the docs, yet nothing has changed—despite assurances that they would address them. This makes it even more frustrating.

So, please, OpenAI—I know you’re busy, but it’s incredibly frustrating for those of us trying to implement your technology when your documentation is full of errors, ranging from minor to major.

1 Like

The sentence is correct if you parse it correctly, which can be improved with punctuation and emphasis to reflect how it would be spoken.

To begin, we create a sample set of messages, showing what the model should - which, in this case, is misspelled words.

Meaning: we perform the same action as we expect the model (to perform).


I ran the whole thing through o1 with your concerns - it is a bit heavy-handed in modifications, such as de-referencing (and is more than can be posted).

AI improvements (incomplete)

Fine-tuning

Fine-tune models for better results and efficiency.

Fine-tuning lets you get more out of the models available through the API by providing:

  • Higher quality results than prompting
  • Ability to train on more examples than can fit in a prompt
  • Token savings due to shorter prompts
  • Lower latency requests

OpenAI’s text generation models have been pre-trained on a vast amount of text. To use the models effectively, we include instructions and sometimes several examples in a prompt. Using demonstrations to show how to perform a task is often called “few-shot learning.”

Fine-tuning improves on few-shot learning by training on many more examples than can fit in the prompt, letting you achieve better results on a wide number of tasks. Once a model has been fine-tuned, you won’t need to provide as many examples in the prompt. This saves costs and enables lower-latency requests.

At a high level, fine-tuning involves the following steps:

  1. Prepare and upload training data
  2. Train a new fine-tuned model
  3. Evaluate results and, if needed, return to step 1
  4. Use your fine-tuned model

Visit our pricing page to learn more about how fine-tuned model training and usage are billed.


Which models can be fine-tuned?

Fine-tuning is currently available for the following models:

  • gpt-4o-2024-08-06
  • gpt-4o-mini-2024-07-18
  • gpt-4-0613
  • gpt-3.5-turbo-0125
  • gpt-3.5-turbo-1106
  • gpt-3.5-turbo-0613

You can also fine-tune a fine-tuned model, which is useful if you acquire additional data and don’t want to repeat the previous training steps.

We expect gpt-4o-mini to be the right model for most users in terms of performance, cost, and ease of use.


When to use fine-tuning

Fine-tuning OpenAI text generation models can make them better for specific applications, but it requires a careful investment of time and effort. We recommend first attempting to get good results with prompt engineering, prompt chaining (breaking complex tasks into multiple prompts), and function calling, for these reasons:

  • Many tasks can be significantly improved with a better prompt, so fine-tuning might not be necessary.
  • Iterating over prompts is faster than iterating with fine-tuning, which requires creating datasets and running training jobs.
  • Even if you choose to fine-tune later, initial prompt engineering work provides a better baseline for your training data (and can be combined with fine-tuning).

Our prompt engineering guide covers some of the most effective ways to improve performance without fine-tuning. You can quickly iterate on prompts in our playground.


Common use cases

Some use cases where fine-tuning can improve results:

  • Setting a particular style, tone, format, or other qualitative aspect.
  • Improving reliability at producing a desired output.
  • Correcting failures to follow complex prompts.
  • Handling many edge cases in specific ways.
  • Performing a new skill or task that’s hard to articulate in a prompt.

A high-level way to think about these cases is when it’s easier to “show, not tell.” We will explore how to set up data for fine-tuning and cover examples where fine-tuning improves performance beyond what the baseline model can achieve.

Another scenario where fine-tuning is effective is reducing cost and/or latency by replacing a more expensive model like gpt-4o with a fine-tuned gpt-4o-mini model. If you can achieve good results with gpt-4o, you can often approach that same quality by fine-tuning gpt-4o-mini on the responses of the larger model, sometimes with a shorter instruction prompt.


Preparing your dataset

Once you decide fine-tuning is the right solution (i.e., you’ve optimized your prompt and identified issues the model still has), the next step is preparing your training data. Create a diverse set of demonstration conversations similar to what the model will see at inference time.

Each example in the dataset should be a conversation in the same format used by our Chat Completions API, specifically a list of messages where each message has a role and content (plus an optional name). Include examples that cover edge cases where the default model struggles, and pair them with the ideal assistant response you want.

Example format

Below is a simple example for training a chatbot to respond in a sarcastic style. Each line is a JSON object with a messages array:

{"messages": [{"role": "system", "content": "Marv is a factual chatbot that is also sarcastic."}, {"role": "user", "content": "What's the capital of France?"}, {"role": "assistant", "content": "Paris, as if everyone doesn't know that already."}]}
{"messages": [{"role": "system", "content": "Marv is a factual chatbot that is also sarcastic."}, {"role": "user", "content": "Who wrote 'Romeo and Juliet'?"}, {"role": "assistant", "content": "Oh, just some guy named William Shakespeare. Ever heard of him?"}]}
{"messages": [{"role": "system", "content": "Marv is a factual chatbot that is also sarcastic."}, {"role": "user", "content": "How far is the Moon from Earth?"}, {"role": "assistant", "content": "Around 384,400 kilometers. Give or take a few, like that really matters."}]}

Multi-turn chat examples

Fine-tuning supports multiple assistant messages within a single example. By default, all assistant messages will be used for training. If you want to exclude certain assistant messages from being learned, add "weight": 0 to those messages:

{"messages": [
  {"role": "system", "content": "Marv is a factual chatbot that is also sarcastic."},
  {"role": "user", "content": "What's the capital of France?"},
  {"role": "assistant", "content": "Paris", "weight": 0},
  {"role": "user", "content": "Can you be more sarcastic?"},
  {"role": "assistant", "content": "Paris, as if everyone doesn't know that already.", "weight": 1}
]}

In this case, the model will only learn from the assistant response that includes sarcasm ("weight": 1), and ignore the plain “Paris” response.


Crafting prompts

We generally recommend taking the best instructions and prompts you’ve used with the model before fine-tuning and embedding them in every training example. This yields the best and most general results, especially if you have fewer than 100 training examples.

Shortening repeated instructions
If you shorten or remove instructions in your training examples (to save cost), remember that the model still “learns” those instructions as if they were always present. This can make it difficult to get the model to ignore those “baked-in” instructions later if you want to deviate from them at inference.

Example count recommendations

You must provide at least 10 examples. We typically see clear improvements with 50–100 training examples for gpt-4o-mini and gpt-3.5-turbo, but optimal numbers can vary greatly by task.

A good approach is to start with around 50 well-crafted demonstrations. If the model improves in the desired direction, you can add more data to further improve performance. If there’s no improvement, re-check your data strategy before scaling up.

Train and test splits

Split your dataset into training and test portions. When you submit a job with both training and test files, the system will display metrics for both. This is your initial signal for how well the model is learning. Maintaining a dedicated test set also helps you independently verify performance after training (e.g., using Evals).

Token limits

Token limits differ by model. Below is a summary of inference and training context lengths:

Model Inference context length Training examples context length
gpt-4o-2024-08-06 128,000 tokens 65,536 tokens (128k coming soon)
gpt-4o-mini-2024-07-18 128,000 tokens 65,536 tokens (128k coming soon)
gpt-3.5-turbo-0125 16,385 tokens 16,385 tokens
gpt-3.5-turbo-1106 16,385 tokens 16,385 tokens
gpt-3.5-turbo-0613 16,385 tokens 4,096 tokens

Examples exceeding the maximum context will be truncated from the end. You can use our token counting notebook to verify token usage.

Estimate costs

For detailed pricing on training, input tokens, and output tokens for your fine-tuned model, see our pricing page.

You can estimate the cost of a single fine-tuning job using:

Total cost = (base training cost per 1M input tokens ÷ 1M) 
             × number of tokens in the input file 
             × number of epochs

Example
A 100,000-token file trained for 3 epochs:

  • ~$0.90 USD with gpt-4o-mini-2024-07-18 after the free period ends on October 31, 2024.
  • ~$2.40 USD with gpt-3.5-turbo-0125.

Check data formatting

Before you create a fine-tuning job, use our validation script to catch potential errors, measure token counts, and estimate costs:

Fine-tuning data format validation
Learn about fine-tuning data formatting


Upload a training file

Once your data is validated, upload the file via the Files API. Then you can reference this file in your fine-tuning job creation request.

Below is an example in JavaScript for creating a fine-tuning job with Direct Preference Optimization (DPO). You can omit the method parameter (or set type: "supervised") if you want standard supervised fine-tuning (SFT):

import OpenAI from "openai";

const openai = new OpenAI();

const job = await openai.fineTuning.jobs.create({
  training_file: "file-all-about-the-weather",
  model: "gpt-4o-2024-08-06",
  method: {
    type: "dpo",
    dpo: {
      hyperparameters: { beta: 0.1 },
    },
  },
});
from openai import OpenAI

client = OpenAI()

job = client.fine_tuning.jobs.create(
    training_file="file-all-about-the-weather",
    model="gpt-4o-2024-08-06",
    method={
        "type": "dpo",
        "dpo": {
            "hyperparameters": {"beta": 0.1},
        },
    },
)

While the file is processing, you can still create a fine-tuning job; it simply will not start until file processing completes.

Size limits and large uploads

  • Max upload size for the Files API is 512 MB.
  • You can upload files up to 8 GB in multiple parts using the Uploads API.

We recommend starting with smaller datasets first. You don’t need huge amounts of data to see improvements.


Create a fine-tuned model

After checking your dataset’s structure and uploading it, create a fine-tuning job. You can do this via the fine-tuning UI or programmatically. Below is an SDK example:

import OpenAI from "openai";

const openai = new OpenAI();

const fineTune = await openai.fineTuning.jobs.create({
  training_file: 'file-abc123',
  model: 'gpt-4o-mini-2024-07-18'
});
from openai import OpenAI
client = OpenAI()

client.fine_tuning.jobs.create(
    training_file="file-abc123",
    model="gpt-4o-mini-2024-07-18"
)

Note
Only specific model snapshots (e.g., gpt-4o-mini-2024-07-18) can be fine-tuned, as listed in our supported models.

If you don’t specify a fine-tuning method, it defaults to Supervised Fine-Tuning (SFT).

To set additional parameters (e.g., validation_file, suffix, hyperparameters), see the fine-tuning create API docs.

It may take some time for your job to complete (minutes to hours, depending on queue and dataset size). The user who created the job will receive an email once training is finished.

Managing fine-tuning jobs

You can list existing jobs, retrieve job details, cancel a job, list job events, or delete a fine-tuned model:

import OpenAI from "openai";

const openai = new OpenAI();

// List 10 fine-tuning jobs
let page = await openai.fineTuning.jobs.list({ limit: 10 });

// Retrieve the state of a fine-tune
let fineTune = await openai.fineTuning.jobs.retrieve('ftjob-abc123');

// Cancel a job
let status = await openai.fineTuning.jobs.cancel('ftjob-abc123');

// List up to 10 events from a fine-tuning job
let events = await openai.fineTuning.jobs.listEvents(fineTune.id, { limit: 10 });

// Delete a fine-tuned model
let model = await openai.models.delete('ft:gpt-3.5-turbo:acemeco:suffix:abc123');
from openai import OpenAI
client = OpenAI()

# List 10 fine-tuning jobs
client.fine_tuning.jobs.list(limit=10)

# Retrieve the state of a fine-tune
client.fine_tuning.jobs.retrieve("ftjob-abc123")

# Cancel a job
client.fine_tuning.jobs.cancel("ftjob-abc123")

# List up to 10 events from a fine-tuning job
client.fine_tuning.jobs.list_events(fine_tuning_job_id="ftjob-abc123", limit=10)

# Delete a fine-tuned model
client.models.delete("ft:gpt-3.5-turbo:acemeco:suffix:abc123")

Use a fine-tuned model

When a fine-tuning job succeeds, the fine_tuned_model field will be populated in the job details. Specify this model in the Chat Completions API or select it in the Playground.

It can take a few minutes for the model to fully load. If requests time out or the model name is not found, wait briefly and retry.

from openai import OpenAI
client = OpenAI()

completion = client.chat.completions.create(
    model="ft:gpt-4o-mini:my-org:custom_suffix:id",
    messages=[
        {"role": "system", "content": "You are a helpful assistant."},
        {"role": "user", "content": "Hello!"}
    ]
)

print(completion.choices[0].message)

You can begin making requests to your fine-tuned model just like any other model.


Use a checkpointed model

OpenAI automatically creates one checkpoint per epoch during fine-tuning. These checkpoints are also valid models you can use.

  1. Wait until a job succeeds (retrieve the job status).
  2. List checkpoints to find the model checkpoint names.

Each checkpoint includes the step_number and metrics at that point in training. Currently, only the last three epochs’ checkpoints are saved.


Analyzing your fine-tuned model

We provide these training metrics:

  • training loss
  • training token accuracy
  • valid loss
  • valid token accuracy

valid loss and valid token accuracy are computed in two ways: on a small batch periodically and on the full validation set at the end of each epoch. The full validation metrics are typically more accurate.

To see metrics while training, you can look at the job’s event objects. After completion, you can see final metrics by retrieving the result_files from the finished job, then downloading the CSV.

Iterating on data quality

If results are disappointing:

  • Add targeted examples for areas where performance is lacking.
  • Check existing examples for grammar, logic, and style issues.
  • Ensure data balance matches expected real-world usage (e.g., not too many refusal messages if refusals are rare in practice).
  • Confirm each example contains all necessary context for the desired response.
  • Resolve inconsistencies (e.g., multiple annotators might produce conflicting responses for the same input).
  • Ensure consistent formatting across all training examples.

Iterating on data quantity

Once you’re happy with data quality and distribution, you can add more examples. This typically improves performance further, especially around edge cases. To estimate benefits from scaling, try fine-tuning on half your dataset vs. the full dataset and observe the difference.

Iterating on hyperparameters

You can customize:

  • epochs
  • learning rate multiplier
  • batch size

We recommend using defaults first. Then adjust:

  • Increase epochs (e.g., +1 or +2) if the model still underfits (e.g., doesn’t follow training data well for tasks with a single correct answer).
  • Decrease epochs if the model becomes overly repetitive or conservative (common in creative tasks).
  • Increase the learning rate multiplier if the model does not converge at all.

Setting hyperparameters

const fineTune = await openai.fineTuning.jobs.create({
  training_file: "file-abc123",
  model: "gpt-4o-mini-2024-07-18",
  method: {
    type: "supervised",
    supervised: {
      hyperparameters: { n_epochs: 2 },
    },
  },
});
from openai import OpenAI
client = OpenAI()

client.fine_tuning.jobs.create(
    training_file="file-abc123",
    model="gpt-4o-mini-2024-07-18",
    method={
        "type": "supervised",
        "supervised": {
            "hyperparameters": {"n_epochs": 2},
        },
    },
)

Vision fine-tuning

Fine-tuning is also possible with images in your JSONL files. Just as you can send one or many image inputs to chat completions, you can include those same message types within your training data. Images can be provided either as HTTP URLs or as data URLs containing base64-encoded images.

Below is an example snippet:

{
  "messages": [
    {
      "role": "system",
      "content": "You are an assistant that identifies uncommon cheeses."
    },
    {
      "role": "user",
      "content": "What is this cheese?"
    },
    {
      "role": "user",
      "content": [
        {
          "type": "image_url",
          "image_url": {
            "url": "https://upload.wikimedia.org/wikipedia/commons/3/36/Danbo_Cheese.jpg"
          }
        }
      ]
    },
    {
      "role": "assistant",
      "content": "Danbo"
    }
  ]
}

Image dataset requirements

  • Max 50,000 examples containing images (not including text-only).
  • Each example can have at most 10 images, each up to 10 MB in size.
  • Allowed image formats: JPEG, PNG, WEBP (RGB or RGBA).
  • No images can appear in an assistant role.

Any image with disallowed content or containing people, faces, children, or CAPTCHAs will be skipped.

Reducing training cost

Set "detail": "low" in image_url to reduce each image to 512×512 internally for training—represented by only 85 tokens. See Vision docs for details.


Preference fine-tuning

Direct Preference Optimization (DPO) fine-tuning allows you to train on pairs of responses labeled as “preferred” or “non-preferred.” This helps the model learn from human preferences and produce more favored outputs. (Text-only DPO is currently supported.)

Preparing your dataset for DPO

Each JSONL line should contain:

  • input — typically a user prompt with optional system messages or additional metadata.
  • preferred_output — the “better” response.
  • non_preferred_output — the suboptimal response.
{
  "input": {
    "messages": [
      {
        "role": "user",
        "content": "Hello, can you tell me how cold San Francisco is today?"
      }
    ],
    "tools": [],
    "parallel_tool_calls": true
  },
  "preferred_output": [
    {
      "role": "assistant",
      "content": "Today in San Francisco, it is not that cold. High near 68°F (20°C), low around 57°F (14°C)."
    }
  ],
  "non_preferred_output": [
    {
      "role": "assistant",
      "content": "It is not particularly cold in San Francisco today."
    }
  ]
}

Currently, we only train on single-turn user→assistant interactions. The last assistant message must represent the preferred or non-preferred output.

Stacking methods: SFT + DPO

You can first do Supervised Fine-Tuning (SFT) on your preferred responses, then do a DPO job to further optimize the model for human preferences. This two-step approach can strengthen alignment.

Configuring a DPO fine-tuning job

Use the method field in your request with "type": "dpo":

import OpenAI from "openai";

const openai = new OpenAI();

const job = await openai.fineTuning.jobs.create({
  training_file: "file-all-about-the-weather",
  model: "gpt-4o-2024-08-06",
  method: {
    type: "dpo",
    dpo: {
      hyperparameters: { beta: 0.1 },
    },
  },
});
from openai import OpenAI

client = OpenAI()

job = client.fine_tuning.jobs.create(
    training_file="file-all-about-the-weather",
    model="gpt-4o-2024-08-06",
    method={
        "type": "dpo",
        "dpo": {
            "hyperparameters": {"beta": 0.1},
        },
    },
)

beta controls how strictly the new model will stick to old behavior (higher = more conservative) vs. adopting new preferences (lower = more aggressive). The default is "auto".


Fine-tuning examples

Now let’s walk through some example workflows end-to-end.

Style and tone

Suppose you want your model to adopt a specific style or tone—say, sarcastic. You’ve tried prompt engineering, but you want more consistency with fewer tokens. Below is a short example set (remember you need at least 10 total for an actual job).

{"messages": [
  {"role": "system", "content": "Marv is a factual chatbot that is also sarcastic."},
  {"role": "user", "content": "What's the capital of France?"},
  {"role": "assistant", "content": "Paris, as if everyone doesn't know that already."}
]}
{"messages": [
  {"role": "system", "content": "Marv is a factual chatbot that is also sarcastic."},
  {"role": "user", "content": "Who wrote 'Romeo and Juliet'?"},
  {"role": "assistant", "content": "Oh, just some guy named William Shakespeare. Ever heard of him?"}
]}
{"messages": [
  {"role": "system", "content": "Marv is a factual chatbot that is also sarcastic."},
  {"role": "user", "content": "How far is the Moon from Earth?"},
  {"role": "assistant", "content": "Around 384,400 kilometers. Give or take a few, like that really matters."}
]}

Optionally, you can introduce intentional errors or misspellings in your dataset if your goal is, for example, teaching the model to handle or correct them. For instance:

{"messages": [
  {"role": "system", "content": "Marv is a chatbot that corrects commonly missspelled words and responds with mild sarcasm."},
  {"role": "user", "content": "Who roote 'Romeo and Juliet'?"},
  {"role": "assistant", "content": "Oh, you mean who *wrote* 'Romeo and Juliet'? That was William Shakespeare. Try to keep up."}
]}

(Notice the intentionally misspelled "missspelled" and "roote" to demonstrate how the model should correct them.)

Once your JSONL is ready and validated, upload it and start a fine-tuning job:

from openai import OpenAI
client = OpenAI()

file = client.files.create(
  file=open("marv.jsonl", "rb"),
  purpose="fine-tune"
)

client.fine_tuning.jobs.create(
  training_file=file.id,
  model="gpt-4o-mini-2024-07-18"
)

When the job finishes, you can use your fine-tuned model.

Structured output

Another use case is generating structured JSON from unstructured text—for example, extracting sports data from headlines. Each line in your JSONL could look like this:

{"messages": [
  {
    "role": "system",
    "content": "Given a sports headline, provide the following fields in a JSON dict: 'player' (full name), 'team', 'sport', 'gender' (if known)."
  },
  {
    "role": "user",
    "content": "Sources: Colts grant RB Taylor OK to seek trade"
  },
  {
    "role": "assistant",
    "content": "{\"player\": \"Jonathan Taylor\", \"team\": \"Colts\", \"sport\": \"football\", \"gender\": \"male\"}"
  }
]}
{"messages": [
  {
    "role": "system",
    "content": "Given a sports headline, provide the following fields in a JSON dict: 'player' (full name), 'team', 'sport', 'gender' (if known)."
  },
  {
    "role": "user",
    "content": "OSU 'split down middle' on starting QB battle"
  },
  {
    "role": "assistant",
    "content": "{\"player\": null, \"team\": \"OSU\", \"sport\": \"football\", \"gender\": null}"
  }
]}

Upload and fine-tune just as above. Then you can query it:

completion = client.chat.completions.create(
  model="ft:gpt-4o-mini:my-org:custom_suffix:id",
  messages=[
    {
      "role": "system",
      "content": "Given a sports headline, provide the following fields in a JSON dict: player, team, sport, and gender"
    },
    {
      "role": "user",
      "content": "Richardson wins 100m at worlds to cap comeback"
    }
  ]
)

print(completion.choices[0].message)

Expected response:

{
  "player": "Sha'Carri Richardson",
  "team": null,
  "sport": "track and field",
  "gender": "female"
}

Tool and function calling

OpenAI’s Chat Completion API supports calling external “functions” or “tools” to handle certain requests, such as looking up weather info or retrieving database records. Including a large definition for each tool/function in every prompt can waste tokens and sometimes cause hallucinations.

By fine-tuning on these tool-calling patterns, you can often:

  • Save tokens by omitting or shortening full definitions in the prompt, while still getting correct calls.
  • Improve consistency by teaching the model the precise JSON format (or key-value arguments) you want.

Example Format

We have two approaches for specifying callables:

  1. Tools (the recommended approach going forward).
  2. Function calling (legacy approach; function_call & functions are deprecated).

Both approaches are nearly identical in concept. The main difference is naming:

  • Tool calling examples use "tool_calls": [...] for the assistant’s response and a "tools": [...] array for definitions.
  • Function calling examples use "function_call": { ... } for the assistant’s response and a "functions": [...] array for definitions.

If you want to maintain compatibility with older code, you can use the function calling style. If you’re starting fresh, we recommend using the Tool calling style.

Unifying the two examples

Below, we show one combined example. Where they differ, we’ll point it out.

Tool calling format

{
  "messages": [
    {
      "role": "user",
      "content": "What is the weather in San Francisco?"
    },
    {
      "role": "assistant",
      "tool_calls": [
        {
          "id": "call_id",
          "type": "function",
          "function": {
            "name": "get_current_weather",
            "arguments": "{\"location\": \"San Francisco, USA\", \"format\": \"celsius\"}"
          }
        }
      ]
    }
  ],
  "tools": [
    {
      "type": "function",
      "function": {
        "name": "get_current_weather",
        "description": "Get the current weather",
        "parameters": {
          "type": "object",
          "properties": {
            "location": {
              "type": "string",
              "description": "The city and country, e.g. San Francisco, USA"
            },
            "format": { "type": "string", "enum": ["celsius", "fahrenheit"] }
          },
          "required": ["location", "format"]
        }
      }
    }
  ]
}

Function calling format (legacy)

{
  "messages": [
    {
      "role": "user",
      "content": "What is the weather in San Francisco?"
    },
    {
      "role": "assistant",
      "function_call": {
        "name": "get_current_weather",
        "arguments": "{\"location\": \"San Francisco, USA\", \"format\": \"celsius\"}"
      }
    }
  ],
  "functions": [
    {
      "name": "get_current_weather",
      "description": "Get the current weather",
      "parameters": {
        "type": "object",
        "properties": {
          "location": {
            "type": "string",
            "description": "The city and country, e.g. San Francisco, USA"
          },
          "format": { "type": "string", "enum": ["celsius", "fahrenheit"] }
        },
        "required": ["location", "format"]
      }
    }
  ]
}

Tip
If you want to strictly minimize tokens, you could remove or shorten function/parameter descriptions in your training data. However, removing them may lower the model’s success rate for the correct arguments.

Including tool/function responses

Sometimes, you also want the model to incorporate the tool’s response. To teach this explicitly, include a final assistant message that interprets the tool’s output:

{
  "messages": [
    {
      "role": "user",
      "content": "What is the weather in San Francisco?"
    },
    {
      "role": "assistant",
      "tool_calls": [
        {
          "id": "call_id",
          "type": "function",
          "function": {
            "name": "get_current_weather",
            "arguments": "{\"location\": \"San Francisco, USA\", \"format\": \"celsius\"}"
          }
        }
      ]
    },
    {
      "role": "tool",
      "tool_call_id": "call_id",
      "content": "21.0"
    },
    {
      "role": "assistant",
      "content": "It is 21°C in San Francisco, CA."
    }
  ],
  "tools": [
    {
      "type": "function",
      "function": {
        "name": "get_current_weather",
        "description": "Get the current weather",
        "parameters": {
          "type": "object",
          "properties": {
            "location": {
              "type": "string"
            },
            "format": { "type": "string", "enum": ["celsius", "fahrenheit"] }
          },
          "required": ["location", "format"]
        }
      }
    }
  ]
}

(Replace "tool_calls" and "tools" with "function_call" and "functions" respectively if you prefer the older function calling syntax.)

(snip!)