Overview
I’m running evals with file_search using the Responses data source type. The return_value on every file_search tool call is always "File search results: " (21 characters, empty) — for every item, every model (tested gpt-5.4 and gpt-4.1), whether retrieval clearly succeeded (10K+ prompt tokens, file citations present) or clearly failed (3K prompt tokens, no citations). Checked all three surfaces: API output items, dashboard JSONL export, and dashboard UI. All identical.
The Responses API supports include=["output[*].file_search_call.search_results"] to get this data. Tried passing it through the Evals API at sampling_params.include, data_source.include, and top-level include — all rejected with “Unknown parameter.” The available_includes field on output items is always [].
This makes it very difficult to debug eval failures. I can’t tell whether file_search executed, what it returned, or why a question failed — retrieval issue vs model issue.
Environment
-
Models tested:
gpt-5.4,gpt-4.1 -
Eval data source type:
responseswithfile_searchtool -
Vector store: 100+ indexed markdown files, all status
completed -
Reproduced via direct API calls (not SDK-specific)
Reproduction
-
Create an eval with a
responsesdata source that includesfile_searchas a tool -
Run the eval
-
Check the results through any of these three methods:
Method 1 — API:
Fetch output items via GET /evals/{eval_id}/runs/{run_id}/output_items. Check sample.output[].tool_calls[].function.return_value.
Method 2 — Dashboard JSONL export:
Export eval items from the dashboard. Check sample.outputs[].tool_calls[].function.return_value.
Method 3 — Dashboard UI:
View the eval run results in the browser.
All three methods return the same thing for every item:
"return_value": "File search results: "
This is identical for:
-
Items that clearly retrieved documents (10,000+ prompt tokens, file citations in response)
-
Items that retrieved nothing (3,000 prompt tokens, no citations, model says “I don’t have documentation”)
-
Both
gpt-5.4andgpt-4.1
What I expected
I expected the return_value field to contain the file search results (filenames, chunks, scores) — similar to how the Responses API returns them when you set include=["output[*].file_search_call.search_results"].
What I tried
Passing include through the Evals API:
I attempted to pass the include parameter at three levels when creating the eval run. All three were explicitly rejected:
"data_source.sampling_params.include": ["output[*].file_search_call.search_results"]
→ "Unknown parameter: 'data_source.sampling_params.include'"
"data_source.include": ["output[*].file_search_call.search_results"]
→ "Unknown parameter: 'data_source.include'"
"include": ["output[*].file_search_call.search_results"]
→ "Unknown parameter: 'include'"
The Evals API does not accept the include parameter that the Responses API supports.
Checking the dashboard export:
The dashboard JSONL export uses a different structure (trajectory/outputs vs the API’s output) but contains the same empty return_value on every item. No additional file_search data is present anywhere in the export.
Checking the available_includes field:
Each output item returned from the API has an available_includes field. It is always an empty array: "available_includes": [].
Why this matters
Without seeing file_search results, I cannot determine:
-
Whether file_search actually executed or silently failed
-
Whether it returned relevant documents that the model ignored
-
Whether the score threshold filtered out results that were close matches
-
What similarity scores the retrieved chunks had
-
Whether a test failure was caused by bad retrieval vs bad model reasoning
The only indirect signal available is prompt_tokens count — items with ~10,000 tokens likely had chunks injected into context, items with ~3,000 tokens (just the system prompt) likely received nothing. But this is a token count heuristic, not a direct observation of retrieval behavior.