Bad results when using fine-tuned model with function calling

Before function-calling was working for fine-tuned gpt-3.5 turbo model, I fine-tuned a model to have an output with a specific persona and it was working very great. Then I tried to use it with function calling. The function calling works fine but the response messages were terrible (duplicated response, forgets previously supplied info…)
Things I tried so far

  • fine-tune again with function description in each sample as shown on the documentation
  • fine-tune the already fine-tuned model with fewer samples with function description

Both didn’t work. I’m still getting bad results. Is there something I am missing?

3 Likes

Update: I used the dataset and the function shown as an example in the documentation and have similar duplicated response.

1 Like

I’m facing the same issue, including functions in a ChatCompletion call makes the response repetitive and terrible as you said.
I tried different scenarios like fine-tuning on a dataset which contains functions, and also a dataset with no functions, it made no difference.

Given that you’re facing the exact issue, this is most likely a bug?

2 Likes

When I wanted to add function calling to my fine-tuned model today, I also ran into this. Without it, the model responds perfectly. But with the function calling added, it writes all the “non-function” answers twice. For example:
message : {role: assistant, content: 'sample text. \n' + 'sample text'}. The ones that do call a function works as they should.

1 Like

The single example is kind of nonsense for fine-tune on function calling. I have a feeling just as much attention was paid in implementing the fine-tune endpoint for training on functions as in writing the example: nearly none.

Get similarly formatted responses even when the full function definition isn’t present” Oh, really, no need to include a function to get an endpoint that will emit them? BS.

This is me including a dummy function “disabled” with my fine-tune model, and then injecting the AI text specification of a google_for_answers function right into the system prompt:

{
“message”: {
“role”: “assistant”,
“content”: null,
“function_call”: {
“name”: “google_for_answers”,
“arguments”: “{\n "query": "2023 Oscar winners"\n}”
}
},
“finish_reason”: “function_call”
}

But what if I simply don’t have a dummy function call? My fine-tune model now no longer has the knowledge of what a function is:

“message”: {
“role”: “assistant”,
“content”: “I’m sorry, but as an AI language model, I don’t have real-time information or the ability to browse the internet. My training only goes up until September 2021, so I don’t have access to the list of Oscar winners for 2023. I recommend checking reliable news sources or the official Oscars website for the most up-to-date information on the 2023 Oscar winners.”
},
“finish_reason”: “stop”

Have they trained two different models based on my fine tune, one that won’t obey or understand the function specification?

So I can fine-tune on function-calling and pay 8x for what? To reduce the description by a few tokens?


Then there is the single example given in the quickstart:

Format your examples as shown, with each line including a list of “messages” and an optional list of “functions”:

So the list of functions is optional when fine-tuning for function-calling? Then you are guaranteed to have a different system prompt when you try to use that model (because the function specification is in fact mandatory when using), thus having a lower quality of “identity” to matching your model with your tune.

Then of course they disavow every sales point made: “If your goal is to maximize the correctness of the function calling output, we recommend using the same function definitions for both training and querying the fine-tuned model.”

Then they have you train the AI like this on how the response “should be seen” in chat history, using a special format:

{"role": "assistant", "function_call": {"name": "get_current_weather", "arguments": "{\"location\": \"San Francisco, USA\", \"format\": \"celcius\"}"}}

The only thing: they continue to obfuscate the calling language and tokens, so YOU can’t actually put this into chat history the same way as the special notation, further degrading the fine-tune following of functions. Do fine-tune that way, and your conversation history cannot replicate it to activate your AI response to a return. Do it your own way, and ignore their fake function call json, and you get an AI that you can now train on not being able to call functions the right way because you filled its fine-tune with the wrong calling method.

Just piss-poor. Give us the actual AI language in and out of functions. Give us the special tokens and we choose when to block them or insert them. Give us endpoints where we can turn on the function return with a boolean, or instead see what the AI is producing. Let me put every single token into my ChatML container and fill the raw context (instead of you screwing up 0301 with new garbageML) Give us models that you don’t continue to screw with daily.

1 Like

@logankilpatrick this needs some attention, fine-tuned models with function calling are unusable due to these issues, I tried to raise this bug a couple of times.

2 Likes

I’m having the exact same problem. Waited months for function-call fine-tuning to be available, only to see such poor results.

At first, I just thought I had too little data, so I added more. And more. Tripled the amount of training data. And the results are the same like, the same issues keep happening, even with a low validation loss. When I try some of the examples provided in the training or validation files, the result is buggy (duplicate completions, incorrect function calls, etc). It seems like the fine-tuning just worsens everything.

I hope this issue gets the attention it deserves.

4 Likes

Hey folks. I’m about to show how to properly insert your function return back to the AI - something far too useful for OpenAI to document properly.

chat_completion_parameters = {
"model": "gpt-3.5-turbo",
"top_p": 0.5,
"messages": [
{"role": "system", "content": "You are MegaBot, my fine-tune AI identity."},
# chat history goes here
{"role": "user", "content": "Who won the 2024 election?"},
{"role": "assistant", "content": assistant_content_if_exist,
"function_call": {
    "name": called_function_name,
    "arguments": called_function_args_json,
    }
},
{"role": "function", 
 "name": called_function_name,
 "content": "Rudy Giuliani! LOL."},
]}

When the AI emitted a finish reason of “function_call”, it gave you:

  • a function name: above, called_function_name
  • a function argument: called_function_args_json
  • it might have also said something, assistant_content_if_exist

The function argument might not be json or valid, but giving the AI back its wrong output is part of iterative correction.
The AI might not have prefaced the function call with chat, but it can. null “” is also OK to send.

This might help your chatbot where you didn’t give it hundreds or thousands of fine tune call examples like OpenAI did. Providing this correct chat history should be the same format that you fine tuned your return on, as per the single tutorial.

The entire chat history as used before should be passed again in the second API call with the function return, along with the function definition, so the AI knows what it called and what it might call again.

You should continue including the assistant/function roles losslessly each additional turn until AI is done with its function usage and only answers the user.

2 Likes

Thanks for your investigation!

But I don’t know if that’s really the root of the problem, I’ve been doing just like that since the beginning because I’m finetuning based on real-world chats (I copy the full chat history, including the assistant’s function_call and the role: function with name and content, and then edit them to contain the expected values). No luck yet, having odd results (including the “duplicated completion”) since my first fine-tune using functions.

This is pretty much what the quick start gave us.

It does not address the problems with fine-tuned models and functions enabled.

I address that too.

However, I can’t fix the elevated expectations vs the poor results of fine-tune.

I thought it might be helpful for you to have a working example (or not working, as the case may be). Here is a working demonstration of the problem with the following python script and my tuned model:


import openai
import config
import mysecrets

# Set your OpenAI API key and organization (if applicable)
openai.api_key = mysecrets.OPENAI_API_KEY
openai.organization = config.OPENAI_ORG

chatParams = {
    "model": "ft:gpt-3.5-turbo-0613:artist::8LhakJy8",
    "temperature": 0.7,
    "messages": [
        {"role": "assistant", "content": "What can I help you with today?"},
        {"role": "user", "content": "yo"},
    ]
}

print("ChatCompletion results with fine-tuned model:")
print(openai.ChatCompletion.create(**chatParams))

chatParams["functions"] = [
    {
      "name": "get_current_weather",
      "description": "Get the current weather in a given location",
      "parameters": {
        "type": "object",
        "properties": {
          "location": {
            "type": "string",
            "description": "The city and state, e.g. San Francisco, CA"
          },
          "unit": {
            "type": "string",
            "enum": ["celsius", "fahrenheit"]
          }
        },
        "required": ["location"]
      }
    }
  ]

print("ChatCompletion results with fine-tuned model and functions:")
print(openai.ChatCompletion.create(**chatParams))

The results exemplify the problem:

% py broken-model.py
ChatCompletion results with fine-tuned model:
{
  "id": "chatcmpl-8LkKYXeeCoQXpz4czYbzSl7BxD3Lg",
  "object": "chat.completion",
  "created": 1700193674,
  "model": "ft:gpt-3.5-turbo-0613:artist::8LhakJy8",
  "choices": [
    {
      "index": 0,
      "message": {
        "role": "assistant",
        "content": "Hello! How can I assist you today?"
      },
      "finish_reason": "stop"
    }
  ],
  "usage": {
    "prompt_tokens": 20,
    "completion_tokens": 9,
    "total_tokens": 29
  }
}
ChatCompletion results with fine-tuned model and functions:
{
  "id": "chatcmpl-8LkKZDhXiyfe5yTCKiPjKWrZCfFQo",
  "object": "chat.completion",
  "created": 1700193675,
  "model": "ft:gpt-3.5-turbo-0613:artist::8LhakJy8",
  "choices": [
    {
      "index": 0,
      "message": {
        "role": "assistant",
        "content": "Hello! How can I assist you today?\nHello! How can I assist you today?"
      },
      "finish_reason": "stop"
    }
  ],
  "usage": {
    "prompt_tokens": 87,
    "completion_tokens": 20,
    "total_tokens": 107
  }
}

Interestingly, I ran it 10 times and it doubled the call with functions only twice. With my app (with functions included) it gave me two responses with every response every time.

In the above case, I used a trained gpt-3.5-turbo-0613 model, but the error is also apparent on the gpt-3.5-turbo-1106 model as well:

ChatCompletion results with fine-tuned model and functions:
{
  "id": "chatcmpl-8LkTJGnZw117M4xWnINFfHBsl5D40",
  "object": "chat.completion",
  "created": 1700194217,
  "model": "ft:gpt-3.5-turbo-1106:artist::8KAhri96",
  "choices": [
    {
      "index": 0,
      "message": {
        "role": "assistant",
        "content": "Hello! How can I assist you today?\nHi! How can I help you?"
      },
      "finish_reason": "stop"
    }
  ],
  "usage": {
    "prompt_tokens": 87,
    "completion_tokens": 19,
    "total_tokens": 106
  }
}

In another thread, sardararslan033 suggested using stop words as a clever workaround to the duplicated response bug.:

response = openai.ChatCompletion.create(
   model = "ft:gpt-3.5-turbo-0613:artist::8M1uARRP",
   temperature = 0.7,
   top_p = 1.0,
   frequency_penalty = 2.0,
   presence_penalty = 0.0,
   stream = False,
   stop = ['####'],
   messages = [...],
   functions = [{'name': 'lookup_person', 'description': 'Get information about a person mentioned in the prompt for the first time.', 'parameters': {'type': 'object', 'properties': {'name': {'type': 'string', 'description': 'The name of the person to look up, e.g. Benzy'}}}}]
)

Fancy! It works.

This means that you will have to add stop words to all of your training data, which I did programmatically

When I call openai.ChatCompletion.create after attaching function_call to messages (currently using openai version 0.28.0, python), I get the following error:

{'code': 1003, 'message': 'JSON parse error: Cannot construct instance of `com.azure.ai.openai.models.FunctionCall` (although at least one Creator exists): no String-argument constructor/factory method to deserialize from String value (\'{"name": "generate_new_task", "arguments": "{\\n  \\"task_question_id\\": \\"Q0\\"\\n}"}\'); nested exception is com.fasterxml.jackson.databind.exc.MismatchedInputException: Cannot construct instance of `com.azure.ai.openai.models.FunctionCall` (although at least one Creator exists): no String-argument constructor/factory method to deserialize from String value (\'{"name": "generate_new_task", "arguments": "{\\n  \\"task_question_id\\": \\"Q0\\"\\n}"}\')\n at [Source: (PushbackInputStream); line: 1, column: 10248] (through reference chain: com.azure.ai.openai.models.ChatCompletionsOptions["messages"]->java.util.ArrayList[2]->com.azure.ai.openai.models.ChatMessage["function_call"])'}

The method works on OpenAI 0.28.1 and also should work on newer python versions. I can’t speak for it implementation on Azure’s schema, but you can use the latest supported 2023-09-01-preview API specification in your Azure call for OpenAI services.

What I demonstrate is the actual json/dictionary variable sent to the API, needing translation to python library parameters or to a byte string sent directly as an https request.

To use this type of dictionary data, you’d send as:

response = openai.ChatCompletion.create(**chat_completion_parameters)

The ** double-asterisk method converts a dictionary to keyworded parameters for a function.

Thanks for your reaching out. I found the problem. The call with openai-python only works with dict, not string!