Responses API file_search: later tool-call ranking/score seems biased by earlier topic even with “one-topic-per-call”

I’m building a hotel guest-support (concierge) system using the Responses API with a Vector Store + built-in file_search tool.

What I’m trying to do

Guests often ask multiple unrelated questions in a single message (e.g., kids co-sleeping + parking).

To avoid retrieval failures, I updated my prompt so that the model:

• identifies topics
• runs one file_search tool call per topic
• never mixes multiple topics into the same file_search call/query

This part works: the logs show that the 2nd tool call query is parking-only.

The issue

Even though the 2nd file_search call query is only about parking, the returned hits + scoring/ranking appear to be biased toward the earlier topic (kids co-sleeping). In the logs:

• Tool call #1 (kids) returns correct hits with a normal-ish score (e.g., ~0.6+)
• Tool call #2 (parking-only query) often returns co-sleeping-related QA as higher-ranked hits, and the correct parking QA appears only with a very low score (e.g., ~0.10) — sometimes barely passing the score_threshold, sometimes not.

Because of this, my assistant sometimes fails to answer the parking question and escalates to staff, even though the FAQ contains the correct parking information.

This looks like the ranking/score for later tool calls is not purely based on the explicit query string of that call, or the “score” is not comparable/meaningful in the way I assumed.

Why this matters

My goal is:

If I split topics into separate file_search calls, each topic should be retrieved/scored independently.

Right now, multi-topic messages are still fragile because later-topic retrieval seems to “inherit” relevance from earlier-topic content.

**
Current setup (simplified)**

• Model: gpt-5.2 (also tried the dated variant shown in logs)
• tool_choice: “auto”
• max_num_results: 5
• Prompt forces one topic per tool call (no mixing)

Example guest message (Japanese):

“I forgot to select two kids in the reservation (they will co-sleep). Also, can I use the parking lot before check-in on the day of arrival?”

Prompt snippet (tool policy)

• Mixed topics (STRICT): If the guest message contains multiple topics/questions, handle them one topic at a time. Run file_search as a separate tool call per topic (do not mix multiple topics in the same file_search call).

What I expected

• Tool call #2 query = parking-only
• Tool call #2 results should be parking QAs at the top, with a score that reflects that query only.

What I observe

• Tool call #2 query is parking-only (confirmed in logs), but results/ranking often show co-sleeping QA above the parking QA, and parking appears with very low score.

Questions

1. Is file_search ranking/score supposed to be independent per tool call?
Or can the tool’s query rewriting / reranking incorporate prior tool calls or broader conversation context?

2. What exactly does the score shown in the Responses logs represent?
Is it a per-call similarity score, a post-rerank score, or something not meant to be compared across calls?

3. Is there a way (with built-in file_search) to disable query rewriting or reranking so that each call behaves more like a “pure” query→retrieval ranking?
(I’m aware the lower-level vector_stores.search endpoint has more explicit ranking controls, but I’m trying to keep the simpler built-in tool if possible.