How to do repeatable testing for ChatGPT prompts?

I am running in circles around gpt-3.5-turbo. Every day, I am making some modification to the system prompt, to prevent any of the following behaviours:

  1. adding greetings and filler content before the answer (“Sure, I can tell you how to do X”) instead of outputting the answer directly
  2. outputting the answer inside quotation marks (“hello”) instead of without them (hello)
  3. Adding explanations to the answer (“This is how X works”) to make it longer when the user didn’t even ask for the explanation

After making the modification, I test it on a couple of sample inputs, and it seems to work fine so I commit it. Then the next day, someone else comes up with a new sample input with which this system prompt doesn’t work - ChatGPT doesn’t follow the instruction I gave in the system prompt. Now, I have to modify the prompt and test on all my sample inputs all over again.

So, question: is there a repeatable testing module for chatgpt - where I can add my sample inputs and have it run the prompt on them - to check that the outputs satisfy some requirements (like the three requirements I gave above)?

I can write this myself, but I’m hoping someone else has thought through this first :slightly_smiling_face:

Follow up question: ChatGPT answers are non-deterministic, so sometimes the underlying issues are only revealed on the fifth or sixth submission (even for the same prompt). Is there any way to make these issues be “revealed” quickly on the first/second test run itself? By issues, I am referring to instances when ChatGPT doesn’t follow instructions from the system prompt (I gave three examples above).