Best methods for regression testing our own OpenAI Assistant throughout development?

blinger · August 2, 2024, 8:30pm

We’re developing our own company-internal OpenAI Assistant using the API, including the use of the file search function and function calls. Everything’s going great, but as our expectations and use cases expand, it’s becoming tougher to ensure the desired responses are given for the common prompts we expect from end users. We don’t need exact matches, obviously, but just to confirm the Assistant calls the right functions with the right parameters, responses include some expected keywords, are framed in the desired format, etc.

Normally this is the kind of thing we’d use unit tests for, that doesn’t quite fit for a AI-based chatbot. Is there a library or technique that’s recommended for this kind of thing?

The only thing I can think of would be to design test cases and allow the chatbot itself to evaluate whether its previous response fits the expected output. For example:

Test case prompt: Give me a list of all 50 US states, with each item containing the name of the state and its two-letter abbreviation.
Test case evaluations:

Does the response contain a list of 50 entries?
Does the list include an item with Delaware, and the abbreviation DE?
Does the response contain a table? (expecting no)

Something like this would probably work, obviously with some boilerplate instructions to answer succinctly with a yes or no. Though it feels like I might be reinventing the wheel here. Is there a better or more commonly accepted way of doing this?

Topic		Replies	Views
Evaluation Tools for Assistants API gpt-4 , gpt-35-turbo , chatgpt , api , assistants-api	1	982	April 12, 2024
Recommended API Testing Tool for Working with OpenAI APIs? API gpt-4	1	640	August 8, 2024
Benchmark & Evaluation Frameworks for Assistants API gpt-4 , chatgpt , api , assistants-api	0	544	April 25, 2024
Unit Test for chatGPT in Node.JS (jest) Community gpt-35-turbo , chatgpt , gpt-35-turbo-16k	1	1274	May 5, 2024
How to test an API, built on GPT? API	2	2539	April 9, 2024

Best methods for regression testing our own OpenAI Assistant throughout development?

Related topics