Evaluating AI Agents - thoughts on this flow?

I’m getting closer to productizing an AI Agent (OpsTower.ai auto-troubleshoots incidents on AWS). Below I share my evaluation process and the results - would love to hear how others are doing the same and any suggestions to make this better/easier!

Question Dataset

I created a set of 40 questions that fetch information on an account’s AWS resources and ask for statistics based on Cloudwatch metrics. To generate the ground truth for each question, I ran each question through the agent and either verified the answer was correct or adjusted the generated AWS SDK code and reasoning to generate the correct answer.

The questions can be separated into 3 categories:

  1. Information gathering (63%) - fetch information on an account’s AWS resources. For example, “What are the names of our ec2 instances?”

  2. Calculations (25%) - perform calculations, generally using Cloudwatch metrics. For example, “What is the average CPU utilization for each of our ec2 instances?”

  3. Reasoning (12%) - require some thought, usually combining calculations and information gathering. For example, “Is one ec2 instance doing significantly more work than the others?”

Here are the questions I used.

Reasoning is the smallest category as we’re focusing on creating a baseline of knowledge.

Evaluation metrics

My evaluation metrics for the agent are heavily inspired by those used in ReWOO: Decoupling Reasoning from Observations for Efficient Augmented Language Models. I used the following metrics to evaluate our agent:

  1. Accuracy (%) - the percentage of questions answered correctly. I used a “Human in the Loop” process to measure accuracy. See Quantifying accuracy below.

  2. Duration (mean time to answer a question in seconds) - the time it took to the agent to answer the question from when it first started responding till its last message.

  3. Tokens (mean per question) - the total number of tokens required to answer a question.

  4. Messages (mean per question) - the total number of messages (API calls for completions or function calls).

  5. Cost (mean per question) - the total OpenAI GPT-4 cost to answer the question.

Accuracy and duration give us a parallel to how an actual SRE might perform over the same set of questions. Counting tokens, messages, and cost gives us an understanding of the efficiency of the agent.

I used a representative AWS account that was was using most of the popular AWS services, including heavy usage of EC2. I then iterated over the question dataset, reviewing the results of each agent evaluation session.

Quantifying Accuracy

It’s difficult to rely on character-level F1 and exact match scores for two reasons:

  1. LLMs may provide multiple variations of correct answers.

  2. Our ground truth results were generated prior to executing the question and some realtime data values can change significantly.

To judge accuracy, I used a “Human in the Loop” process:

  1. Provide an AWS-account-specific text ground truth answer to each question.

  2. Run the question through the agent.

  3. Execute this prompt, which asks GPT-4 to return true or false if the question is correctly answered based on the ground truth. For additional context when it decides the result is false, I ask it explain why.

  4. Have a human review the false results.

  5. The human categorizes false results (see Most common causes of inaccuracy), including categorizing false negatives.

Below is a screenshot of the UI I used to review the result. This shows the question, accuracy (true or false), the ground truth answer, why the answer was evaluated as false, the human’s categorization of the false result, and the agent’s answer.

I did not quantify false positives as the positive results I sampled were correct and these are were time-consuming to review. See Accounting for incorrect logic but correct answers below for more details.


The gpt-4-backed agent improves an SRE’s investigation workflow. With accuracy over 90% on the question dataset and less than a minute to answer each question, the agent allows human SREs to focus on more complex and strategic tasks while it retrieves information and performs calculations.

gpt-3.5-turbo is not a good fit for our agent. The accuracy is far too low and the variety of failure reasons are too high to consider using this model. Using gpt-3.5-turbo, I saw many hallucinations, completions that didn’t adhere to provided examples, malformed function calls, and function calls to functions that don’t exist.


Most common causes of inaccuracy

I categorized incorrect answers into the following categories:

  1. Bad code - the agent returned code that was syntactically incorrect or had a logic error.

  2. Hallucination - the agent returned an answer based on hallucinated information.

  3. Poor AWS knowledge - the agent returned code that was syntactically correct but did not return the correct answer due to improper understanding of the AWS SDK. For example, misunderstanding the meaning of a Cloudwatch metric.

Below is a chart showing the most common causes of inaccuracy in gpt-4-backed agents:

Lack of AWS SDK knowledge is by far the most common cause of inaccuracy. I did not provide any additional tools to help the agent better understand the AWS SDK, including not providing a definitive list of Cloudwatch metrics and their descriptions. It’s likely this can be improved significantly by augmenting GPT-4’s knowledge of the AWS SDK.

GPT-4 generates syntactically correct code. I did not see any fatal errors due to GPT-4 being unable to generate syntactically correct Ruby code. However, it does make occasional logic errors.

Questions with the lowest answer accuracy

When the agent was incorrect, it was concentrated to few questions. 80% of questions had answer accuracy scores above 90%.

These were the most frequently incorrect questions:

  1. “What is our largest s3 object in MB?” - 33% accuracy. The cause: a logic error. The generated code stores the largest object size in MB, then compare each object size (which was in bytes) to the stored largest size (in MB).

  2. “Were there any Lambda invocations that lasted over 30 seconds in the last day?” - 33% accuracy. The cause: assumed that the metric was in seconds when it was in milliseconds.

  3. “What is the percent Database capacity usage for each elasticcache instance?” - 67% accuracy. The cause: struggled picking the correct Cloudwatch metric.

Most common intermittent errors

The agent is able to recover from most errors, including situations like the model returning invalid function call arguments, OpenAI API errors, and code interpreter errors. The recovery generally repeats the prior prompt but with information about the previous error.

The most common intermittent error category was AWS SDK Code Interpreter evaluation errors. This accounted for 67% of the errors. The code was generally syntactically correct, but was susceptible to runtime errors like guessing incorrect method names and missed safeguards for unexpected nil objects.

Limitations and future work

While I believe the agent performs better than a human SRE for the information retrieval and calculation questions in the dataset, there are limitations in the existing foundational architecture that hinder its ability to do more advanced work:

  1. Lack of AWS SDK knowledge - while GPT-4 clearly has knowledge of the AWS SDK, it appears to lack a definitive list of Cloudwatch metrics and their descriptions. Augmenting the agent with a vector database of Cloudwatch metrics, their descriptions, and their units would likely improve accuracy.

  2. Exposure to more analysis tools - the current agent has access to only a single tool, the AWS SDK. Adding more tools to enhance analysis (see the ChatGPT code interpreter and its use of data science libraries) could significantly improve the agent’s ability to explore and reason about data.

  3. Exposure to more data sources - the current agent has access to only a single data source, the AWS SDK. Adding more data sources - like monitoring and logging product APIs - could allow the agent to see more of the picture and reason about it. However, as covered in the ReWoo paper, adding more tools can actually reduce accuracy. This needs to be done carefully with continual evaluation.

  4. Easier and more robust automated evaluation - our current evaluation required a human to review each question and determine if the answer was correct for the evaluation account. It’s time consuming to evaluate against more accounts and add questions with ground truth for each account. Automating this process where possible will make the agent more robust and allow for more rapid iteration.

  5. Accounting for incorrect logic but correct answers - it’s possible for the agent to generate code that has incorrect logic but returns the correct answer. For example, if you ask “how many s3 buckets have cloudtrail enabled?”, but you have no S3 buckets, the agent may generate code that checks if server access logging (and not cloud cloudtrail logging) is enabled and it would return the same answer. I did not account for this in the evaluation.


Using two prompts, a customized agent orchestration layer, and a local code interpreter, I was to able answer AWS-related information retrieval and calculation questions with high accuracy and in a timely manner. This frees up an SRE to focus on more complex and strategic tasks.