What's your process for automated testing for AI agents?

dlite · July 11, 2023, 7:30pm

I’m working on an AI agent capable of reasoning, using external tools, and providing a final answer based on tool output. This uses function calling as well.

I’ve gotten to the point where I’ve developed an initial automated testing procedure:

Iterate over a question bank (questions I expect the agent to answer), asking each question 3 times (want to judge similarity of responses).
Count how many have 100% answers, > 50% answers, 100% errors
Calculate a similarity score (using GPT-4) for each set of responses to a question. Want this to be high so answers are consistent.

Curious how others are testing their agents to ensure end users experience some. consistency?

Below is some example output for a single test run:

codie · July 11, 2023, 9:09pm

I have a much more unit test driven approach. I usually send my AI through several queries before the user gets a final answer. So like does the user ask a question that violates policy, please give a strategy for to deal with the customer, etc. And then I aggregate their responses into a prompt for the main agent. I only do checks on the steps up until the final aggregated response.

I haven’t really found a way to test out the final aggregated response, but I like your approach and I’ll probably incorporate at least part of it in some way. But in general my models have been pretty stable if you test and stabilize the origin responses.

dlite · July 11, 2023, 9:19pm

Thanks - that’s an interesting approach! I need to think thru if unit tests make sense in my scenario. Obviously pass/fail would be easier.

brettpthomas · July 11, 2023, 10:50pm

For a chatbot, I have a long jupyter notebook with chat examples that are analagous to your question bank.

asserts are there to test for regressions, and this works really well to test function calling. It’s also helpful to scan through the actual chats to see if any language looks awkward.

But this is just now starting to get unwieldy. I’ll need to make some changes for proper CI soon.

I also have a few unit tests with the LLM calls mocked. Not super valuable, but they were easy to write with chatgpt.

churqing · July 12, 2023, 10:22am

it is a good idea to use gpt-4 to calculate similarity amoung response set. I would save down the q&a in the database, so agent search for q&a db first. however making sure the correct answer is saved is another challenge. I dont think the similar responses necessarily mean they are correct answers.

dlite · July 12, 2023, 1:39pm

Great point. I’m testing against realtime data right now, so the responses could change. I could record the API calls the agent makes (there are quite a few) so the function call results are deterministic. Maintaining that could be a large burden.

Right now, it feels like I’m looking for:

Response type consistency. Does it sometimes encounter a fatal error and sometimes get a correct final answer? That’s frustrating for an end user.
Answer factual accuracy. Are the answers for the same question providing roughly the same facts? If not, that triggers a lack of trust.

jwatte · July 12, 2023, 3:18pm

My approach is to set temperature to 0, then run a question bank with the full set of prompts (system,user,agent.)
The response is then run against a set of manually coded regular expressions of “must include all of,” “must include one of,” and “may not include any of” flavors.
Creating the regexes to make sure that they sense good answers, but reject bad answers, is hard work, and somewhat error prone, but for my application it works well.
(FWIW, the June model releases regressed in a few cases that we caught here, so at least that’s a proof that the method delivers some value…)

dlite · August 2, 2023, 4:25pm

Thanks @jwatte!

My approach is to set temperature to 0, then run a question bank with the full set of prompts (system,user,agent.)

Did you find significant differences between temperature values?

The response is then run against a set of manually coded regular expressions of “must include all of,” “must include one of,” and “may not include any of” flavors.

Got it. Did you try using an LLM to actually compare the ground truth vs the answer and return if the answer is correct?

Creating the regexes to make sure that they sense good answers, but reject bad answers, is hard work, and somewhat error prone, but for my application it works well.

Agreed. My long-term concern is that the effort to write these tests is too significant resulting in less breadth of the eval suite.

jwatte · August 2, 2023, 6:40pm

Yes.

No – the point is to get validation of expectations as expressed in a “ground truth” manner, without separate interpretation by a model.

We get everything the model can do for free, so the total cost of the end solution can be much cheaper than if you start from scratch, but engineering is engineering – there will always be a significant evaluation and testing cost. Gathering requirements, and then ensuring those requirements are met, is the essence.

saumy.m · January 23, 2024, 10:51am

@dlite Are you using any specific model to check consistency or are you just feeding the answers to GPT-4? Can you help me understand a little on how does it calculate consistency? Like does it work on a 0 or 1 (true or false) kind of logic like to decide whether one particular response is similar to the baseline or does it give percentage or decimal values on a certain scale as well? For example, “Temperature in New York is 25C” and “Today’s temperature is 25C” are essentially not consistent in a 0-1 scale but they might be somewhat consistent i a percentage scale.

Also, could you please share your prompt, if any, to implement this testing.

fidelnonics · April 23, 2024, 12:15pm

@dlite the image you shared is from what tool?

I would like to setup something similar

jschmid · April 23, 2024, 6:54pm

That’s their own personal tool so I don’t think you’ll be finding it anywhere

campkathleen3 · January 22, 2025, 9:09am

“Great approach! Consistent response evaluation using similarity scores ensures the AI agent delivers reliable and accurate answers for end-users.”

Topic		Replies	Views
How to test an API, built on GPT? API	2	2780	April 9, 2024
Evaluating AI Agents - thoughts on this flow? Community gpt-4	0	3436	August 2, 2023
Model Sliding: A Logical Approach to AI Model Selection Community gpt-4 , chatgpt , api	7	1416	July 12, 2023
Scoring results API	3	1488	July 2, 2021
Hallucination problem in my QnA bot which was built using openAI embeddings and gpt 3.5-turbo completions API API	10	2986	December 19, 2023

What's your process for automated testing for AI agents?

Related topics