The goal is to evaluate the systemâs ability to plan multi-step workflows dynamically. The optimal sequence of actions cannot always be predefined. There are simply too many cases to consider.
Specifically, we want to measure how well the system-generated workflow approximates an ideal or optimal workflow and how effectively it enables the final desired outcome.
Use Case Example:
Consider the scenario of responding to a customer inquiry: âWhere is my delivery?â
This involves more than just a straightforward answer. While we may know that âthe delivery is late,â the actual goal is to provide the sales agent with sufficient, actionable information to appropriately reply to the customer. To achieve this, the system must predict a workflow that gathers and organizes the relevant information.
For instance, addressing âwhy is the delivery late?â may require several substeps:
Define delivery expectations: Retrieve the agreed delivery terms from the contract.
Check delivery status: Query the ERP system for the relevant data.
Assess context: Review internal and external communications (e.g., between Customer, Sales, Purchasing, Management) for any additional details.
Prioritize information: Identify what is essential, useful, and optional to include in the response.
Structure findings: Organize the data into a coherent and actionable format.
Deliver the information package.
The final âinformation packageâ is certainly critical, but to evaluate, debug and improve the system, it is equally important to assess the workflow it planned.
Did the system identify the correct steps?
Were the steps ordered logically?
Is the planned order of execution efficient?
Did it miss any crucial information
Could it deliver an acceptable result based on the chosen path?
And for this type of evaluation, comparing predicted sequential actions to an ideal sequence of actions we donât have any eval functions.
But this will be a big thing in the âyear of agentsâ.
add custom function calling to see the impact on the eval
add operator (chain of action models) to the evals
add sora to the evals
add search to evals
add video (multiple base64 frames) to evals
add non structured data2jsonl feature
add text model automation of the entire evals workflow that lets the user see the draft and make edits before running the eval
add sharing and public evals with rating system of upvotes and downvotes with a moderation endpoint to make sure the eval follows the guidelines
My rational around this is that it would be nice to be able to use other models, it would be see how function calling is impacting an eval, and in my humble opinion, to get more people to use it, the process has to almost be effortless, friction-less and less overwhelming.
I think the evals, starting at the github repo, then moving to the platform dashboard is great and the uiux is amazing. great job and I hope to see more improvements there! good luck!