Evals product in Playground - Announcement and feedback

The OpenAI staff released/updated evals in the Playground.

https://platform.openai.com/docs/guides/evals

Watch the video; it is worth the 5 minutes.

As Kevin noted in the end, they are asking for feedback. Please post here unless an OpenAI staff member creates an official topic.

9 Likes

I’ll monitor this thread :slight_smile:
There’s a very good chance feedback here makes it into the product, so please share it!

6 Likes

Thanks for sharing the video and the link!

We have had recurring requests to evaluate multi-turn conversations and agentic workflows from a end-to-end perspective.

If this could make it into the final product that would be more than fantastic.

5 Likes

@vb can you give me a sketch of what you’d want to evaluation here?
Best would be to describe your use-case, and what you would want to evaluate?

1 Like

The goal is to evaluate the system’s ability to plan multi-step workflows dynamically. The optimal sequence of actions cannot always be predefined. There are simply too many cases to consider.
Specifically, we want to measure how well the system-generated workflow approximates an ideal or optimal workflow and how effectively it enables the final desired outcome.

Use Case Example:

Consider the scenario of responding to a customer inquiry: “Where is my delivery?”
This involves more than just a straightforward answer. While we may know that “the delivery is late,” the actual goal is to provide the sales agent with sufficient, actionable information to appropriately reply to the customer. To achieve this, the system must predict a workflow that gathers and organizes the relevant information.

For instance, addressing “why is the delivery late?” may require several substeps:

  1. Define delivery expectations: Retrieve the agreed delivery terms from the contract.

  2. Check delivery status: Query the ERP system for the relevant data.

  3. Assess context: Review internal and external communications (e.g., between Customer, Sales, Purchasing, Management) for any additional details.

  4. Prioritize information: Identify what is essential, useful, and optional to include in the response.

  5. Structure findings: Organize the data into a coherent and actionable format.

  6. Deliver the information package.

The final “information package” is certainly critical, but to evaluate, debug and improve the system, it is equally important to assess the workflow it planned.

  • Did the system identify the correct steps?
  • Were the steps ordered logically?
  • Is the planned order of execution efficient?
  • Did it miss any crucial information
  • Could it deliver an acceptable result based on the chosen path?

And for this type of evaluation, comparing predicted sequential actions to an ideal sequence of actions we don’t have any eval functions.
But this will be a big thing in the ‘year of agents’.