Inconsistency in OpenAI's response with the same prompt

Good morning.

I am testing OpenAI’s API to create suggestions for C# class unit tests. I have created a Python script that reads a .cs file and uses a prompt:

“role”: “system”, “content” : “You are an expert in .NET 6 and C#”
“role”: “user”, “content”: "Create a unit test with xUnit, Shouldly, and Moq for the following class: "

I am using the OpenAI library in python (openai.ChatCompletion.create) and the gpt-3.5-turbo model.

The issue is that sometimes it generates a test class, but the generated class may be different from one another. Other times, it provides totally random responses that are unrelated to the prompt. Sometimes, it says that it needs more information or that we haven’t provided the class (which is not true because the prompt is exactly the same). It even cites ethics for not being able to help me “cheat” sometimes.

Is there any way to improve this? I understand that it can generate different responses and not always do the same thing, but we do need some consistency in the generated results to have some logic with what we have asked (if we ask for tests, it should generate tests, even if they are different in each request).

Best regards.

It usually is a case of the temperature and top_p parameters, which are used to determines the randomness of the generated output. For code, using a lower temperature is recommended as it keeps the output a bit more deterministic.

Another point I would add is that GPT-4 is much much better at code generation that 3.5 is, so at lower temperatures, you can expect the quality of the code generated to be much better.

A small suggestion, you could change the system content to be “You are a senior test case developer in .NET 6 and C#”. Targeted system message are likely to produce better results, though this is more prominent in GPT 4 compared to 3.5 where the API pays less attention (in 3.5) to the system message

1 Like

Thank you very much for the response, I will try those tips