What's your process for automated testing for AI agents?

I’m working on an AI agent capable of reasoning, using external tools, and providing a final answer based on tool output. This uses function calling as well.

I’ve gotten to the point where I’ve developed an initial automated testing procedure:

  1. Iterate over a question bank (questions I expect the agent to answer), asking each question 3 times (want to judge similarity of responses).

  2. Count how many have 100% answers, > 50% answers, 100% errors

  3. Calculate a similarity score (using GPT-4) for each set of responses to a question. Want this to be high so answers are consistent.

Curious how others are testing their agents to ensure end users experience some. consistency?

Below is some example output for a single test run:

2 Likes

I have a much more unit test driven approach. I usually send my AI through several queries before the user gets a final answer. So like does the user ask a question that violates policy, please give a strategy for to deal with the customer, etc. And then I aggregate their responses into a prompt for the main agent. I only do checks on the steps up until the final aggregated response.

I haven’t really found a way to test out the final aggregated response, but I like your approach and I’ll probably incorporate at least part of it in some way. But in general my models have been pretty stable if you test and stabilize the origin responses.

2 Likes

Thanks - that’s an interesting approach! I need to think thru if unit tests make sense in my scenario. Obviously pass/fail would be easier.

1 Like

For a chatbot, I have a long jupyter notebook with chat examples that are analagous to your question bank.

asserts are there to test for regressions, and this works really well to test function calling. It’s also helpful to scan through the actual chats to see if any language looks awkward.

But this is just now starting to get unwieldy. I’ll need to make some changes for proper CI soon.

I also have a few unit tests with the LLM calls mocked. Not super valuable, but they were easy to write with chatgpt.

it is a good idea to use gpt-4 to calculate similarity amoung response set. I would save down the q&a in the database, so agent search for q&a db first. however making sure the correct answer is saved is another challenge. I dont think the similar responses necessarily mean they are correct answers.

Great point. I’m testing against realtime data right now, so the responses could change. I could record the API calls the agent makes (there are quite a few) so the function call results are deterministic. Maintaining that could be a large burden.

Right now, it feels like I’m looking for:

  1. Response type consistency. Does it sometimes encounter a fatal error and sometimes get a correct final answer? That’s frustrating for an end user.
  2. Answer factual accuracy. Are the answers for the same question providing roughly the same facts? If not, that triggers a lack of trust.
1 Like

My approach is to set temperature to 0, then run a question bank with the full set of prompts (system,user,agent.)
The response is then run against a set of manually coded regular expressions of “must include all of,” “must include one of,” and “may not include any of” flavors.
Creating the regexes to make sure that they sense good answers, but reject bad answers, is hard work, and somewhat error prone, but for my application it works well.
(FWIW, the June model releases regressed in a few cases that we caught here, so at least that’s a proof that the method delivers some value…)

2 Likes

Thanks @jwatte!

My approach is to set temperature to 0, then run a question bank with the full set of prompts (system,user,agent.)

Did you find significant differences between temperature values?

The response is then run against a set of manually coded regular expressions of “must include all of,” “must include one of,” and “may not include any of” flavors.

Got it. Did you try using an LLM to actually compare the ground truth vs the answer and return if the answer is correct?

Creating the regexes to make sure that they sense good answers, but reject bad answers, is hard work, and somewhat error prone, but for my application it works well.

Agreed. My long-term concern is that the effort to write these tests is too significant resulting in less breadth of the eval suite.

Yes.

No – the point is to get validation of expectations as expressed in a “ground truth” manner, without separate interpretation by a model.

We get everything the model can do for free, so the total cost of the end solution can be much cheaper than if you start from scratch, but engineering is engineering – there will always be a significant evaluation and testing cost. Gathering requirements, and then ensuring those requirements are met, is the essence.

1 Like