I’m building an LLM-based application where I already have a custom search engine that returns a list of relevant file IDs based on the user’s query.
What I’d like to do is:
Upload those files to OpenAI, and have the assistant answer only based on that exact list of file IDs — no general retrieval, no full index, just scoped to those specific files.
These files can change dynamically for each query, depending on what my search engine finds.
What I’ve tried:
I explored file_search and vector_store tools in the Assistants API.
However:
The file_ids argument is no longer supported.
Using the attachments + file_search method creates a new unnamed vector store every time I ask a question — like this:
All items from an attached vector store are searched against when using Responses or Assistants built-in tool, with no additional filtering parameter available.
But a dedicated search endpoint
There is another way you can go about this. That is to use the “vector store search endpoint”, pay-per-use at the same rate as Responses tools calls.
The facility that the standalone endpoint offers is “annotations”. You can add metadata to a file when adding it to a vector store, or you can update a file unit in a vector store. You might add a file’s own ID, or your own database key that you are tracking.
You can then specify filters using that metadata, to filter out or only allow certain files. You could construct a whole bunch of “or” statements until you find out if there is a limit on the length of query parameters.
Then you can present that information via chat messages automatically, or by giving the AI a developer function to call upon.
An advantage is that your application is not damaged by OpenAI jamming up the context with messages “the user has uploaded files” that come along with file_search.
Yep this is a solution i thought (the vector store with metadata). The problem is that each vector store has a file number limit of 10k. I want to upload 100k. Maybe i can force the promtp request to search to multiple vector stores with the specific metadata i want.
Moreover is it necessary to use threads in this case?
Threads is part of Assistants. previous_response_id is part of Responses endpoint if you opt for server-side chat state. They only refer to how you maintain a user’s chatbot chat history, and don’t inform data retrieval techniques (except by the inconvenience of persisting what should expire).
If you continue to find ways that OpenAI’s vector store doesn’t fit your needs (apart from not billing upfront for the cost of running embeddings on documents, and doing naive extraction and splitting for you), you might want to investigate your own vector store solution.
Getting a local query from an embeddings model and running cosine similarity with TensorFlow can give you rankings and top-k faster than a remote API call. Then, considering the massive bulk you might be searching against with little semantic differentiation, you can run a high-quality reranker on top results.
Then you can think about your own partitioning techniques.
You want to keep language generation models out of the equation, because they are magnitudes more expensive and slower.