Evals API: file_search return_value always empty — no way to see retrieval results

Overview

I’m running evals with file_search using the Responses data source type. The return_value on every file_search tool call is always "File search results: " (21 characters, empty) — for every item, every model (tested gpt-5.4 and gpt-4.1), whether retrieval clearly succeeded (10K+ prompt tokens, file citations present) or clearly failed (3K prompt tokens, no citations). Checked all three surfaces: API output items, dashboard JSONL export, and dashboard UI. All identical.

The Responses API supports include=["output[*].file_search_call.search_results"] to get this data. Tried passing it through the Evals API at sampling_params.include, data_source.include, and top-level include — all rejected with “Unknown parameter.” The available_includes field on output items is always [].

This makes it very difficult to debug eval failures. I can’t tell whether file_search executed, what it returned, or why a question failed — retrieval issue vs model issue.

Environment

  • Models tested: gpt-5.4, gpt-4.1

  • Eval data source type: responses with file_search tool

  • Vector store: 100+ indexed markdown files, all status completed

  • Reproduced via direct API calls (not SDK-specific)

Reproduction

  1. Create an eval with a responses data source that includes file_search as a tool

  2. Run the eval

  3. Check the results through any of these three methods:

Method 1 — API:

Fetch output items via GET /evals/{eval_id}/runs/{run_id}/output_items. Check sample.output[].tool_calls[].function.return_value.

Method 2 — Dashboard JSONL export:

Export eval items from the dashboard. Check sample.outputs[].tool_calls[].function.return_value.

Method 3 — Dashboard UI:

View the eval run results in the browser.

All three methods return the same thing for every item:


"return_value": "File search results: "

This is identical for:

  • Items that clearly retrieved documents (10,000+ prompt tokens, file citations in response)

  • Items that retrieved nothing (3,000 prompt tokens, no citations, model says “I don’t have documentation”)

  • Both gpt-5.4 and gpt-4.1

What I expected

I expected the return_value field to contain the file search results (filenames, chunks, scores) — similar to how the Responses API returns them when you set include=["output[*].file_search_call.search_results"].

What I tried

Passing include through the Evals API:

I attempted to pass the include parameter at three levels when creating the eval run. All three were explicitly rejected:


"data_source.sampling_params.include": ["output[*].file_search_call.search_results"]

→ "Unknown parameter: 'data_source.sampling_params.include'"

"data_source.include": ["output[*].file_search_call.search_results"]

→ "Unknown parameter: 'data_source.include'"

"include": ["output[*].file_search_call.search_results"]

→ "Unknown parameter: 'include'"

The Evals API does not accept the include parameter that the Responses API supports.

Checking the dashboard export:

The dashboard JSONL export uses a different structure (trajectory/outputs vs the API’s output) but contains the same empty return_value on every item. No additional file_search data is present anywhere in the export.

Checking the available_includes field:

Each output item returned from the API has an available_includes field. It is always an empty array: "available_includes": [].

Why this matters

Without seeing file_search results, I cannot determine:

  • Whether file_search actually executed or silently failed

  • Whether it returned relevant documents that the model ignored

  • Whether the score threshold filtered out results that were close matches

  • What similarity scores the retrieved chunks had

  • Whether a test failure was caused by bad retrieval vs bad model reasoning

The only indirect signal available is prompt_tokens count — items with ~10,000 tokens likely had chunks injected into context, items with ~3,000 tokens (just the system prompt) likely received nothing. But this is a token count heuristic, not a direct observation of retrieval behavior.