Best methods for regression testing our own OpenAI Assistant throughout development?

We’re developing our own company-internal OpenAI Assistant using the API, including the use of the file search function and function calls. Everything’s going great, but as our expectations and use cases expand, it’s becoming tougher to ensure the desired responses are given for the common prompts we expect from end users. We don’t need exact matches, obviously, but just to confirm the Assistant calls the right functions with the right parameters, responses include some expected keywords, are framed in the desired format, etc.

Normally this is the kind of thing we’d use unit tests for, that doesn’t quite fit for a AI-based chatbot. Is there a library or technique that’s recommended for this kind of thing?

The only thing I can think of would be to design test cases and allow the chatbot itself to evaluate whether its previous response fits the expected output. For example:

Test case prompt: Give me a list of all 50 US states, with each item containing the name of the state and its two-letter abbreviation.
Test case evaluations:

  • Does the response contain a list of 50 entries?
  • Does the list include an item with Delaware, and the abbreviation DE?
  • Does the response contain a table? (expecting no)

Something like this would probably work, obviously with some boilerplate instructions to answer succinctly with a yes or no. Though it feels like I might be reinventing the wheel here. Is there a better or more commonly accepted way of doing this?