The goal is to evaluate the systemâs ability to plan multi-step workflows dynamically. The optimal sequence of actions cannot always be predefined. There are simply too many cases to consider.
Specifically, we want to measure how well the system-generated workflow approximates an ideal or optimal workflow and how effectively it enables the final desired outcome.
Use Case Example:
Consider the scenario of responding to a customer inquiry: âWhere is my delivery?â
This involves more than just a straightforward answer. While we may know that âthe delivery is late,â the actual goal is to provide the sales agent with sufficient, actionable information to appropriately reply to the customer. To achieve this, the system must predict a workflow that gathers and organizes the relevant information.
For instance, addressing âwhy is the delivery late?â may require several substeps:
Define delivery expectations: Retrieve the agreed delivery terms from the contract.
Check delivery status: Query the ERP system for the relevant data.
Assess context: Review internal and external communications (e.g., between Customer, Sales, Purchasing, Management) for any additional details.
Prioritize information: Identify what is essential, useful, and optional to include in the response.
Structure findings: Organize the data into a coherent and actionable format.
Deliver the information package.
The final âinformation packageâ is certainly critical, but to evaluate, debug and improve the system, it is equally important to assess the workflow it planned.
Did the system identify the correct steps?
Were the steps ordered logically?
Is the planned order of execution efficient?
Did it miss any crucial information
Could it deliver an acceptable result based on the chosen path?
And for this type of evaluation, comparing predicted sequential actions to an ideal sequence of actions we donât have any eval functions.
But this will be a big thing in the âyear of agentsâ.