Model's Tuning and Re-Tuning Problems

:rocket: Working on Fine-Tuning LLMs – Need Some Expert Advice! :robot:

I’ve been experimenting with model training using both general user data and tool-specific data from our platform.
Initially, I combined (shuffled) both types of data for fine-tuning. Later, I shifted to a phased learning approach:

:right_arrow: Phase 1: Fine-tuned the model with general platform data.
:right_arrow: Phase 2: Re-tuned it with only tool-related data using adapters from Phase 1.

Despite this structured approach, I’m still facing a few recurring issues:

:one: Why does the model randomly return the system prompt (e.g., the one provided in system role) in its replies?
:two: Why does it ask for tool details even during general greetings or unrelated platform queries?
:three: Why does it miss some required fields when constructing tool calls?
:four: Why does it invent new tool parameters not defined in the schema?
:five: Why doesn’t it ask for all required fields in plain text consistently?
:six: After a tool call, why does it repeat previous answers instead of asking for required fields again if the same query is asked?

:magnifying_glass_tilted_left: If anyone has tackled similar issues while fine-tuning LLMs (e.g., using LoRA adapters or phased training), I’d love to hear your thoughts or tips!

:speech_balloon: Feel free to comment or DM—any insights are truly appreciated. Thanks in advance!

Usually, the advice is to fine-tune one model with a diverse dataset that represents how it’ll be used in production.

So, if you expect the model to refuse 3% of requests, then 3% of your dataset should be refusals. If you expect a specific tool 30% of the time, then 30% of your examples should call this tool, etc etc.

You can use the batches API to synthetically generate a large number of examples once you find a prompt that yields good data.

It should be learning to act like your dataset. If this isn’t happening at all in your data, then the model might be overfitted. What does your training loss graph look like in your dashboard?

Are you including your schema in both fine-tuning and inference?


If you’re able to provide more information, it may help figure out what’s wrong. This may include:

  • Examples from your dataset
  • The length of your dataset
  • Graphs from your fine-tuning job in the dashboard
  • Any hyperparameters used. If you didn’t specify any, does OpenAI tell you what was used?

In any case, it’s possible you may not need to fine-tune for your use case. What are base models struggling with that convinced you to fine-tune?

1 Like

I am using mistralai/Mistral-7B-Instruct-v0.3 it doesn’t support system prompt formatting in its chat_template so we have to add system like system.content\n\nuser.content while inference and while tuning .apply_chat_template() due to chat template not includes the system.content unless last message in conversation is of role: user.

HERE is my dataset(400 Examples only for tool in other phases then 1st) is in chat/conversational format like:

[
  [
    {
      "role": "system",
      "content": "You are Flite's event assistant. Given the user prompt: \"Flite for table booking?\" and your previous reply: \"Flite-city bridges digital creators and offline experiences through smart event tools. From ticketing to promoter payouts, Flite gives you full control over your event lifecycle.\", ensure all future responses stay accurate, helpful, and strictly within Flite's tools."
    },
    {
      "role": "user",
      "content": "Flite for table booking?"
    },
    {
      "role": "assistant",
      "content": "Flite-city bridges digital creators and offline experiences through smart event tools. From ticketing to promoter payouts, Flite gives you full control over your event lifecycle."
    }
  ],
  [
    {
      "role": "user",
      "content": "Help me set up a holiday sale code."
    },
    {
      "role": "assistant",
      "content": "I\u2019m ready! Please send me your event, desired code, discount type, value, and limit."
    },
    {
      "role": "user",
      "content": "Wait\u2014how can I enable NFC check-in for my event?"
    },
    {
      "role": "assistant",
      "content": "Enable NFC check-in by activating 'NFC Passes' in your event settings. Distribute NFC tickets and scan with the Flite app."
    }
  ],
  [
    {
      "role": "system",
      "content": "You are a helpful assistant for Flite. Help users create promo codes by collecting event, name, type (Fixed or Percentage), value, and limit. When all are received, respond only with the function_call JSON."
    },
    {
      "role": "user",
      "content": "I want a 30% discount code for event-fashion-show-spring"
    },
    {
      "role": "assistant",
      "content": "I can help you create a 30% discount code for event-fashion-show-spring. I need:\n\n- What would you like to name the promo code?\n- What's the limit?"
    },
    {
      "role": "user",
      "content": "Actually, let me do this for my tech conference. Event tech-ai-summit, name TECH, $30 off, 120 people."
    },
    {
      "role": "assistant",
      "content": "Please confirm the discount type (Fixed or Percentage)."
    },
    {
      "role": "user",
      "content": "Fixed"
    },
    {
      "role": "assistant",
      "content": "Lets get it done with the tool call.",
      "tool_calls": [
        {
          "type": "function",
          "function": {
            "name": "createPromoCode",
            "arguments": "{\"event\": \"tech-ai-summit\", \"name\": \"TECH\", \"type\": \"Fixed\", \"value\": 30, \"limit\": 120}"
          },
          "id": "t7u8v9w0x"
        }
      ]
    }
  ],
.... So on
]

Oh.

I don’t really know anything about Mistral or about fine-tuning non-OpenAI models. However, I know that 7B parameters is about as small as you can tangibly go before the model becomes incomprehensible.

In other words, I wouldn’t expect to successfully use such a tiny model as an assistant. I believe you can accomplish what you’re trying to do with OpenAI base models, which are much larger, and are called via API so you don’t need giant GPU clusters.

It also seems like you’re trying to fine-tune information about your platform into the model. Please be advised that this is very difficult to do. Fine-tuning can’t expand the model’s dimensions, thus for each thing a fine-tuned model learns, it must unlearn something else, and this can harm its accuracy. You’ll need to explain your platform to the model, even if briefly, in order to get any favorable results.

Even with that aside, your prompts seem to be all over the place. Your first prompt is structured weirdly, your second prompt is missing any instructions, and your third prompt seems to never provide the function schema. It’s usually a good idea to find a prompt that yields at least some accuracy before attempting to fine-tune the model and improve that accuracy.