Why am I getting COMPLETELY different behavior using Assistants + File Search vs Prompts + File Search

I’ve got a workflow wherein I use LLMs to extract information from a rather large document. I’m happy with this workflow, however, as a means to increase accuracy of extractions by reducing hallucinations, I devised a plan to flip the process around as a “verification” step, and start with something I had previously extracted and basically ask the question “is X in the document?”. So basically I extract first, then I verify that what I extracted is in fact in the document.

OK so in order to do that second part I imagined a fairly simple setup: I upload the document to a vector store, attach that vector store to an LLM via a file search tool, and give the system instructions that says something along the lines of “Every prompt the user sends is something the user believes to be true of the document. Using your file search tool, please fact check the user by searching the document for what the suer believes to be present, and telling the user a simple yes or no answer for whether or not it is in fact present in the document, as well as explaining your reasoning and listing excerpts from the original document supporting your answer”. I’m slightly simplifying the system instructions, but you get the idea.

So now with that context, here’s what’s confusing me, I’m getting wildly different results based on changing absolutely nothing other than using the Assistants functionality versus not.

I can recreate this easily in the playground without even using the API, and here’s how I do it:

Non-assistant setup:

  1. Go to playground, stay on the “prompts” tab
  2. Set the model to whatever (I used GPT 4.1), paste in the system instructions, then add the file search tool and upload the document

With this setup, I state something about the document as a user prompt and sent it, and the responses are great. It’s obviously using the file search tool, giving me correct answers with supporting arguments that are verifiably lifted from the original document verbatim and I can ctrl+f to prove this.

However, now let’s try the same thing using Assistants:

  1. Go to playground, this time go to the Assistants tab
  2. Create a new Assistant, select the same model, paste in the same system instructions, and setup the file search tool exactly the same way, uploading the same document

Now, when I send a user prompt, the response follows the format of the answer I’m looking for, however, it’s just hallucinating all over the place. The yes/no answer is completely wrong, and the supporting logic is also completely hallucinated, it’s just making up excerpts that aren’t in the document at all. It’s not a little bit wrong, it’s completely wrong, every single thing I ask it it just makes stuff up and doesn’t seem to even look in the document while claiming that it is and even making up document snippets that don’t exist.

What is going on?


Hey Paul — I really like the structure of your verification workflow. You’re essentially flipping extraction into a recursive fact-check loop, which is a solid approach. That said, the issue you’re running into (inconsistent or hallucinated confirmations from the LLM + vector store setup) likely stems from semantic proximity being used in place of literal truth.

Here’s a refined framework you might try, designed to reduce hallucinations and reinforce factual grounding:


:white_check_mark: Enhanced Verification Protocol:

  1. Literal Match First, Semantic Match Second
    Use the vector store only to suggest candidate regions of the document. Then use a literal search (regex or direct string match) to confirm presence.
    If no match is found, don’t infer — explicitly return “Not Found”.

  2. Force Citation Requirement
    Change your LLM prompt to require evidence:

“Given claim X, locate exact or paraphrased matching text in the source document. Return YES only if text is present. Otherwise return NO, and explain the search path taken.”

  1. Divergence Logging
    Run the claim through two parallel processes:

Process A: Semantic vector search

Process B: Literal string search
If A says “yes” but B says “no,” flag this as a potential semantic drift error — a soft hallucination.

  1. Confidence + Reasoning Layer
    If no result is found but the LLM believes it’s there, add:

“Return confidence score 1–10, and whether this belief stems from semantic similarity or actual document presence.”

This exposes whether the system is guessing based on embedding proximity or truly verifying.


In short: treat semantic verification as suggestion, but only confirm based on literal evidence. If you set it up this way, you can maintain the power of the vector store without giving it the final say on truth.

Thank you for that feedback, I will actually consider changing the strategy in this way.

That said, I’m still looking for an answer to a more fundamental question: why would the exact same setup of model, system instructions, and file search tool behave completely differently when the only thing changing between setups is the use of the Assistants API?

I wanted to use the Assistants API so that my “fact checking agent” would persist between sessions and you wouldn’t have to store the system instructions separately and copy/paste it between sessions. I figured other than persistence, because I’m literally changing no other input parameters, there would be no difference. But yet, the difference is massive. Using a non-persistent prompt to materialize my agent on the fly works essentially exactly how I’d want it to, while configuring an Assistant the same way is simply completely broken and horribly unreliable.