When I create an eval using the responses data source, it generates a new output for every item.
In other words, the eval grader sees both the original input and output and the newly generated output appended to the conversation. This is the case whether I’m testing a different model or not. Sometimes the new output will be the original output regurgitated (because that’s what the LLM decided to output). Other times it will be a plain text description of my original structured output json.
Looking at the data:
- datasource_item.output has my original JSON
- sample.output contains a regenerated plain text version / JSON
- sample.usage shows 8,295 completion_tokens for each item
Is this expected behavior? If so, is there a way to evaluate existing responses without regeneration?
This happens via both the UI and the API.