Improving File Search specificity w/ Assistant for accurate document processing

Context:
I’m using an OpenAI Assistant in an automation workflow to analyze and extract key information from various PDF documents which are chronologically correlated. The assistant’s knowledge compounds over time, with each document representing a new development being added to the vector store. However, it primarily needs to focus on processing and extracting information from the latest uploaded document while incorporating relevant information from previous documents in the latest document’s analysis.

Problem:
When asked to analyze and process the latest uploaded file (using file_search tool), the assistant sometimes retrieves information from a previous document instead of the correct, most recently uploaded file. This disrupts the workflow and leads to inaccurate analyses, as it seemingly ignores the latest document and processes a similar previous one.

Current Workflow:
before automation is deployed, a new thread is created which is linked to my assistant with detailed custom instructions. A vector store is also created which gets attached to the thread.

  1. PDF file is uploaded

  2. File is then added to the thread’s vector store with a specific file_id.

  3. New message is created - the file_id is attached to the user message and the message specifies the name of the latest file to retrieve, as seen below:
    Mode: Automation. Please carry out your roles and duties for the latest document: "007. Final XXX letter 3.20.24.pdf". Ensure that you reference and incorporate any relevant information from previous documents in your analysis, if applicable.

  4. create and poll a run, while also requiring file_search tool.

  5. The response is then received in JSON (as per assistant instructions) and then used in subsequent steps of my automation.

Requirement:
The assistant needs to ensure it retrieves and processes the correct file as specified in the user message.

Questions:
My first guess is that it’s not retrieving the correct file due the the fileSearch tool:

“Rewrites user queries to optimize them for search.” (docs)

and some how the filename is getting confused with another file perhaps?

  • Can I specify the file_id in the message (instead of filename) for improved file retrieval precision instead? would that work?

  • What is the best way to ensure accurate file retrieval for a workflow such as this?

  • Is there a better workflow/method to handle such a use case as this?

Any advice or solutions to ensure accurate file retrieval or to improve this workflow would be greatly appreciated.

thank you for your time!

3 Likes

Did you find answers to any of your questions? I’d like to know what you learned as I’m also interested in specifying specific documents for file_search to use. If that’s not possible with the vanilla API I figured I might create one off vector stores with just the file(s) I care about and attach just that store to the assistant when submitting my file-specific prompt

Similar/duplicate question here: Search only a specific file within an attached vector store

Yes correct it is still open and unresolved thread.

@lyon.lay and @gmf can you guys tell me if you found any solution or approach to handle this issue.

The problem I’m facing is that my assistant can have multiple json files and I expect that my assistant categories and generate some metadata from that files based on test, sections and question stem.

but it is getting confused when similar tests come in picture like praxis core and praxis mathematics content so updated system instruction to select based on test provided in user prompt but still not working.

even thought of generating multiple vector stores each for each file and assigning at thread level but thread specific tool resources has restrict to attach maximum 1 thread only…

spent enough time on this issue but still not getting any positive approach can you help or suggest something in it?

with regards,
dharmesh