Okay to answer both your questions:
- Again, how long is your “system prompt” Are you using a role: system message or role: developer message?
The difference between them is depending on what model you use, but the role: system message is not very good compared to the role: developer message or passing “instructions” through the responses API.
So am I to presume that you are using “role: system” message with a model like “gpt 4o latest”? And this is why you are having trouble?
Yes, you can fine-tune a fine-tuned model. There is a specific section in the API documentation about that. https://platform.openai.com/docs/guides/fine-tuning/
https://platform.openai.com/docs/guides/fine-tuning/#can-i-continue-fine-tuning-a-model-that-has-already-been-fine-tuned
Most of your questions I think would be answered within the API documentation about specific fine-tuning.
What I’ve experienced with complex levels of function calls is that you have to split yourself up in an agentic way.
For example, in my system I have tools that:
- Read and write documents from the system
- Use the terminal
- Perform web searches
- Manage the context window of the API call itself
Etc.
However, even using the most powerful models to date (i.e. o1 with high reasoning and max input/output settings), these models simply can’t handle this level of complex tool calling, while still maintaing a good “flow”. They sort of get stuck into a track with recency bias, and fail to “step back and see the big picture of all they have access to” right?
So one way around it is making the application agentic. I.e. you have a single initial LLM call that acts as the orchestrator and says “Okay I know we are going to do this detailed process, but all I have to do is the simple process “call the agent that takes care of the complexity” (i.e. passing in the actual detail, evaluating the response from the system, potentially performing recursive checks to get it right ,etc.”.
This way your “orchestrator” handles the top-level logic of the conversation, and the dirty work happens in seperate LLM calls that don’t “muddy up the context window”.
Again, you don’t have to rely explicitly on the system_prompt itself either, if you do a go prompt-engineering and Retrieval-Augmented-Generation system without using a fine-tune:
- Are you showing the full conversation to the end user? Probably not, it sounds like you are just displaying one-message at (i.e. one assistant/LLM response) at a time to the user, and then the user continues the interaction, but isn’t seeing a “chat style” context window. Even if they are, you can still control what is/or isn’t visible on the context window.
Basically, from any given standpoint, you can design a system that detects and injects additional prompts where necssary to sort of “bolster” the conversation that the LLM is having with the user. For example. If you can parse that you are triggering a goodbye message before the prompt hits the LLM or even catch the goodbye message from the LLM before it’s shown to the user and perform a check that tells you if it’s correct or not and then simply, if not, calls the LLM again with a super specific prompt that says “Please say goodbye now in this way: …”.
I’ve personally found again that the models are so general, so vast, that their behavior is not only potentially unpredictable but essentially unsuited for something that overall must be handled at a programmatic/system engineering level. You can USE the LLM as a part of the system, but treating the LLM as the “system itself” and hoping that eventually you can get it to perform “perfectly all of the time”. Is pretty much a moot point - my opinion is that’s not going to work, and even if it does… the models themselves are changing/evolving because of all the users who have opted in to data feedback. So the model is dynamic. It’s like telling your employee to say goodbye the same way everytime. Well over time their personality changes, and they might start saying it differently… even if you tell them otherwise… it’s just a tricky business, using these LLM’s to get dedicated, replicable responses is really not what they are designed for.
BUT, if you make your system flexible, and install your own “middleware level” of checks and balances to “watch the LLM’s performance and intervene (unseen to the user)” under certain edge case circumstances, to provide a prompt “guiding the LLM back to desired pathway”… then you might have replicable success..