Inconsistencies in GPT-3.5-turbo Model Behavior During Load Testing

Hello all ! Hope you are having a nice weekend.
I hope you’re well. I’m currently working on implementing a consistent customer service flow using the GPT-3.5-turbo model through the Azure OpenAI Studio API. I’ve configured the model with a temperature of 0 and top_p of 0.1 to achieve deterministic responses.
I’ve conducted load tests to analyze the consistency of the generated flow and answers. However, since the flow is quite complicated and involves lots of step and branch, I can see gpt3.5 does have issue with following the instruction completely. After days of work, I have managed to make it as stable as possible, however only within certain period of the day.
So I noticed when I run load test against my model (let’s say 10am in the morning). 9 out of 10 test result are almost the same, and behave correctly.
However after 2-3 hours, when I run the test again, only 1 of the test result bahave correctly and the other 9 just behave completely differently than before.
Since the temperature and top_p are really low, the only thing I can imagine that cause the difference is seed value, however it does look like during certain period, the model is just using the same seed, and after some time, the seed will change which generate completely different result. (I am using openai client 0.27 something which doesn’t support seed at all ) .

I haven’t found relevant documentation or discussions on this topic, and I’m reaching out to the community for insights.
Has anyone experienced a similar issue, or does anyone have information or resources that could shed light on this behavior?

Thanks so much