Evals product in Playground - Announcement and feedback

EricGT · January 6, 2025, 7:23pm

The OpenAI staff released/updated evals in the Playground.

https://platform.openai.com/docs/guides/evals

Watch the video; it is worth the 5 minutes.

As Kevin noted in the end, they are asking for feedback. Please post here unless an OpenAI staff member creates an official topic.

defjosiah · January 6, 2025, 7:32pm

I’ll monitor this thread
There’s a very good chance feedback here makes it into the product, so please share it!

vb · January 6, 2025, 10:49pm

Thanks for sharing the video and the link!

We have had recurring requests to evaluate multi-turn conversations and agentic workflows from a end-to-end perspective.

If this could make it into the final product that would be more than fantastic.

defjosiah · January 10, 2025, 5:18pm

@vb can you give me a sketch of what you’d want to evaluation here?
Best would be to describe your use-case, and what you would want to evaluate?

vb · January 10, 2025, 8:56pm

The goal is to evaluate the system’s ability to plan multi-step workflows dynamically. The optimal sequence of actions cannot always be predefined. There are simply too many cases to consider.
Specifically, we want to measure how well the system-generated workflow approximates an ideal or optimal workflow and how effectively it enables the final desired outcome.

Use Case Example:

Consider the scenario of responding to a customer inquiry: “Where is my delivery?”
This involves more than just a straightforward answer. While we may know that “the delivery is late,” the actual goal is to provide the sales agent with sufficient, actionable information to appropriately reply to the customer. To achieve this, the system must predict a workflow that gathers and organizes the relevant information.

For instance, addressing “why is the delivery late?” may require several substeps:

Define delivery expectations: Retrieve the agreed delivery terms from the contract.
Check delivery status: Query the ERP system for the relevant data.
Assess context: Review internal and external communications (e.g., between Customer, Sales, Purchasing, Management) for any additional details.
Prioritize information: Identify what is essential, useful, and optional to include in the response.
Structure findings: Organize the data into a coherent and actionable format.
Deliver the information package.

The final “information package” is certainly critical, but to evaluate, debug and improve the system, it is equally important to assess the workflow it planned.

Did the system identify the correct steps?
Were the steps ordered logically?
Is the planned order of execution efficient?
Did it miss any crucial information
Could it deliver an acceptable result based on the chosen path?

And for this type of evaluation, comparing predicted sequential actions to an ideal sequence of actions we don’t have any eval functions.
But this will be a big thing in the ‘year of agents’.

EricGT · April 16, 2025, 2:16pm

With the announcement of GPT 4.1 there is a lot of new information out there, one of interest related to evals

anon1374209 · April 16, 2025, 4:45pm

add custom function calling to see the impact on the eval

add operator (chain of action models) to the evals

add sora to the evals

add search to evals

add video (multiple base64 frames) to evals

add non structured data2jsonl feature

add text model automation of the entire evals workflow that lets the user see the draft and make edits before running the eval

add sharing and public evals with rating system of upvotes and downvotes with a moderation endpoint to make sure the eval follows the guidelines

My rational around this is that it would be nice to be able to use other models, it would be see how function calling is impacting an eval, and in my humble opinion, to get more people to use it, the process has to almost be effortless, friction-less and less overwhelming.

I think the evals, starting at the github repo, then moving to the platform dashboard is great and the uiux is amazing. great job and I hope to see more improvements there! good luck!

Topic		Replies	Views
Evals in the OpenAI dashboard Announcements	2	680	December 2, 2024
OpenAI Evals analogous to Fine Tuning? API	5	1191	August 2, 2024
Evaluations Beta custom eval prompt Prompting openai-documentation , evals	5	344	December 18, 2024
Model Sliding: A Logical Approach to AI Model Selection Community gpt-4 , chatgpt , api	7	1351	July 12, 2023
How do we submit evals for function calls? API	1	1170	September 7, 2023

Evals product in Playground - Announcement and feedback

Related topics