Fine-Tuning GPT-4o Mini for Bakery Chatbot with Function Calling

I’m working on a fine-tuned GPT-4o Mini model for a bakery chatbot that needs to:

  1. Greet customers with a specific welcome message: “Welcome to ABC Bakery, how can I help you?”
  2. Engage in conversation and call functions when customers ask about order status or want to place an order.
  3. Say goodbye using a predefined script.

I want to avoid adding all these instructions to the system prompt due to token limitations. Instead, I plan to fine-tune the model but need guidance on the best approach.

Main Questions:

  • Can a single fine-tuned model handle all three tasks, or do I need separate models for greeting, conversation (with function calling), and farewell?
  • If I can use one model, should I fine-tune it in stages (e.g., first greetings, then function calling, then farewells), or should I train it on all tasks at once?
  • How do I structure my fine-tuning dataset to ensure the model learns each behavior effectively?
  • Any best practices, tutorials, or references for fine-tuning GPT-4o Mini for this use case?

I’m new to fine-tuning and would appreciate any guidance. Thanks in advance!

Why would you need fine-tuning?

I think any of the current models (4o models, o3-mini, o1, etc.) can support this out of the box.

Fine-tuning is not really appropriate for this use case in my opinion.

What are your “token limitations?”?

If you are providing a relatively short “developer instruction” (depending on which API model your using, i.e. Completions vs Responses) that is around 1k-4k tokens, you should be in great shape… What you described could probably be accomplished in under 500-1k tokens…

The benefit of using a role: developer message or instructions within the Responses API is that you’ll be able to control specific instructions to the model that will override any user attempts to “get a different kind of response” from the interface.

So you simply:

  • Set your developer/message instructions

  • Define your tool calls/functions (if you really really want an absolutely standard greeting/goodbye (instead of allowing the model to ad-lib appropriately, which is really a good idea…) then you might want to even use a tool call for greeting/goodbye, this would realistically be the only way to ensure that you always return “the exact text” and that there’s no deviation… especially across model differences/different interaction cases

  • Test the app!

I do think I have to fine-tune because there are many edge cases where the model is not calling the function, and many cases where it is not retrieving the customer or saying goodbye using the official script. My system prompt has become way too long, and I’m only getting about 70% correct inference. So, I am definitely at a point where I need to fine-tune to teach it to use the function correctly.

I have a very important question. If I fine-tune a model to correctly perform function calling and later need to add a new function, can I fine-tune the existing model to include the new function, or do I have to retrain everything from scratch?

Also, how should I adjust the system prompt in the training set to accommodate these changes? Do I need to create a system prompt so large that it covers everything I’m training for, or is there a better approach?

I’m new to fine-tuning but experienced in prompt engineering. However, due to the growing number of functions and edge cases, I can no longer rely solely on prompt engineering. I need to fine-tune the model so that it aligns with our business goals at a fundamental level.

Okay to answer both your questions:

  1. Again, how long is your “system prompt” Are you using a role: system message or role: developer message?

The difference between them is depending on what model you use, but the role: system message is not very good compared to the role: developer message or passing “instructions” through the responses API.

So am I to presume that you are using “role: system” message with a model like “gpt 4o latest”? And this is why you are having trouble?

Yes, you can fine-tune a fine-tuned model. There is a specific section in the API documentation about that. https://platform.openai.com/docs/guides/fine-tuning/

https://platform.openai.com/docs/guides/fine-tuning/#can-i-continue-fine-tuning-a-model-that-has-already-been-fine-tuned

Most of your questions I think would be answered within the API documentation about specific fine-tuning.


What I’ve experienced with complex levels of function calls is that you have to split yourself up in an agentic way.

For example, in my system I have tools that:

  1. Read and write documents from the system
  2. Use the terminal
  3. Perform web searches
  4. Manage the context window of the API call itself

Etc.

However, even using the most powerful models to date (i.e. o1 with high reasoning and max input/output settings), these models simply can’t handle this level of complex tool calling, while still maintaing a good “flow”. They sort of get stuck into a track with recency bias, and fail to “step back and see the big picture of all they have access to” right?

So one way around it is making the application agentic. I.e. you have a single initial LLM call that acts as the orchestrator and says “Okay I know we are going to do this detailed process, but all I have to do is the simple process “call the agent that takes care of the complexity” (i.e. passing in the actual detail, evaluating the response from the system, potentially performing recursive checks to get it right ,etc.”.

This way your “orchestrator” handles the top-level logic of the conversation, and the dirty work happens in seperate LLM calls that don’t “muddy up the context window”.


Again, you don’t have to rely explicitly on the system_prompt itself either, if you do a go prompt-engineering and Retrieval-Augmented-Generation system without using a fine-tune:

  • Are you showing the full conversation to the end user? Probably not, it sounds like you are just displaying one-message at (i.e. one assistant/LLM response) at a time to the user, and then the user continues the interaction, but isn’t seeing a “chat style” context window. Even if they are, you can still control what is/or isn’t visible on the context window.

Basically, from any given standpoint, you can design a system that detects and injects additional prompts where necssary to sort of “bolster” the conversation that the LLM is having with the user. For example. If you can parse that you are triggering a goodbye message before the prompt hits the LLM or even catch the goodbye message from the LLM before it’s shown to the user and perform a check that tells you if it’s correct or not and then simply, if not, calls the LLM again with a super specific prompt that says “Please say goodbye now in this way: …”.


I’ve personally found again that the models are so general, so vast, that their behavior is not only potentially unpredictable but essentially unsuited for something that overall must be handled at a programmatic/system engineering level. You can USE the LLM as a part of the system, but treating the LLM as the “system itself” and hoping that eventually you can get it to perform “perfectly all of the time”. Is pretty much a moot point - my opinion is that’s not going to work, and even if it does… the models themselves are changing/evolving because of all the users who have opted in to data feedback. So the model is dynamic. It’s like telling your employee to say goodbye the same way everytime. Well over time their personality changes, and they might start saying it differently… even if you tell them otherwise… it’s just a tricky business, using these LLM’s to get dedicated, replicable responses is really not what they are designed for.

BUT, if you make your system flexible, and install your own “middleware level” of checks and balances to “watch the LLM’s performance and intervene (unseen to the user)” under certain edge case circumstances, to provide a prompt “guiding the LLM back to desired pathway”… then you might have replicable success..

I want to sincerely thank you for your response. The links you provided have been incredibly helpful in understanding your agent-based methodology. I truly appreciate this and the time you took to write it.

I promise this will be my last question! I’ve decided to follow your advice and explore an agent-based architecture. This is new to me, so I’d love your guidance.

Which framework or API do you recommend? Which one do you personally use? Based on your experience, is there any approach you’d advise me to avoid? And which specific path would you recommend I take to study this effectively?

If you could also share any additional resources or links, I would be very grateful.

Would you recommend LangGraph, Semantic Kernel, or MCP?

Thank you again—I really appreciate it!

1 Like

That’s very kind of you to say. I’m glad to hear that the information was helpful.

Take a look at these links:

https://openai.com/index/new-tools-for-building-agents/

https://platform.openai.com/docs/guides/agents


Personally, I’m not using either the Agents SDK nor any of the program you mentioned.

I built my own application within python, using the FastAPI uvicorn server to manage async calls and concurrent coroutines.

I built my own custom system of orchestration and agent calls, before the new Agents SDK was released.

So I don’t have much personal experience with the new frameworks, but if you search around you will possibly find others that have used it. As it’s so new, it takes a fair amount of figuring things out (the agents SDK).


I’m sure that the LangChain based solutions and the others you mentioned are likely relevant/appropriate for you, I wish I had more experience to offer in that regard, but I don’t.

Best of luck!