Evals with the responses data source always regenerate outputs

andrew_hackerpulse · October 20, 2025, 8:43pm

When I create an eval using the responses data source, it generates a new output for every item.

In other words, the eval grader sees both the original input and output and the newly generated output appended to the conversation. This is the case whether I’m testing a different model or not. Sometimes the new output will be the original output regurgitated (because that’s what the LLM decided to output). Other times it will be a plain text description of my original structured output json.

Looking at the data:

datasource_item.output has my original JSON
sample.output contains a regenerated plain text version / JSON
sample.usage shows 8,295 completion_tokens for each item

Is this expected behavior? If so, is there a way to evaluate existing responses without regeneration?

This happens via both the UI and the API.

Topic		Replies	Views
Examples are not working properly API	6	546	January 3, 2024
Evaluations, Stored Completions, and sample.output_text? API api	0	144	November 12, 2024
Summarizing often fails if not using davinci Prompting	1	415	February 6, 2024
Is 'regenerate response' a clean slate or does it take into account that it's a regeneration? Prompting gpt4	2	3701	June 18, 2023
Extend: "Was this response better or worse?" API	0	1151	March 27, 2023

Evals with the responses data source always regenerate outputs

Related topics