Improving File Search specificity w/ Assistant for accurate document processing

Context:
I’m using an OpenAI Assistant in an automation workflow to analyze and extract key information from various PDF documents which are chronologically correlated. The assistant’s knowledge compounds over time, with each document representing a new development being added to the vector store. However, it primarily needs to focus on processing and extracting information from the latest uploaded document while incorporating relevant information from previous documents in the latest document’s analysis.

Problem:
When asked to analyze and process the latest uploaded file (using file_search tool), the assistant sometimes retrieves information from a previous document instead of the correct, most recently uploaded file. This disrupts the workflow and leads to inaccurate analyses, as it seemingly ignores the latest document and processes a similar previous one.

Current Workflow:
before automation is deployed, a new thread is created which is linked to my assistant with detailed custom instructions. A vector store is also created which gets attached to the thread.

  1. PDF file is uploaded

  2. File is then added to the thread’s vector store with a specific file_id.

  3. New message is created - the file_id is attached to the user message and the message specifies the name of the latest file to retrieve, as seen below:
    Mode: Automation. Please carry out your roles and duties for the latest document: "007. Final XXX letter 3.20.24.pdf". Ensure that you reference and incorporate any relevant information from previous documents in your analysis, if applicable.

  4. create and poll a run, while also requiring file_search tool.

  5. The response is then received in JSON (as per assistant instructions) and then used in subsequent steps of my automation.

Requirement:
The assistant needs to ensure it retrieves and processes the correct file as specified in the user message.

Questions:
My first guess is that it’s not retrieving the correct file due the the fileSearch tool:

“Rewrites user queries to optimize them for search.” (docs)

and some how the filename is getting confused with another file perhaps?

  • Can I specify the file_id in the message (instead of filename) for improved file retrieval precision instead? would that work?

  • What is the best way to ensure accurate file retrieval for a workflow such as this?

  • Is there a better workflow/method to handle such a use case as this?

Any advice or solutions to ensure accurate file retrieval or to improve this workflow would be greatly appreciated.

thank you for your time!

3 Likes

Did you find answers to any of your questions? I’d like to know what you learned as I’m also interested in specifying specific documents for file_search to use. If that’s not possible with the vanilla API I figured I might create one off vector stores with just the file(s) I care about and attach just that store to the assistant when submitting my file-specific prompt

Similar/duplicate question here: Search only a specific file within an attached vector store