Share your prompt. It’s impossible to diagnose prompt issues without knowing what the prompt is. Just share it and someone will likely be able to point you in the right direction.
Wait… This isn’t a GPT-4 issue? You’re using some open source model you’ve then done a further fine-tune on.
Yeah… maybe someone will be able to give you some advice with the prompt, but you’re literally off-the-map here. No one other than you had any intuition for your model’s behavior.
If you’re just asking this in a general sense, the answer is that GPT models are stochastic token-predicting machines. There’s always going to be some degree of unexpected behavior—by design.
Even if you do everything you can to try to clamp things down, setting a seed, setting a zero-temperature, using a very small top-p, etc, there is still the potential for non-deterministic results due to things like the difficulty of ensuring reproducible results in a heavily parallelized environment, especially when computing on the GPU.
In short, you’ll likely never be able to guarantee 100% reproducible results. The best you can really aim to do is to reduce the incidence-rate of bad results. For anyone here to have a hope of helping you do that though, they’re going to need a lot more information from you.