You have a good question, as this file_search tool and the vector store database is offered without a whole bunch of information about the applications it is useful for.
However, the application you might be thinking about is a bit unclear, but it sounds like this might not be the right solution for you.
The way it works:
- Documents are uploaded to the file storage.
- Then you create a vector store. One of the parameters of a vector store is how many tokens that the data resulting from document extraction will be split into. The default is 800 tokens, perhaps 500 words per chunk.
- Then the tool inclusion in an Assistants or Responses endpoint AI call makes that file search available for the AI to call, with search queries.
- The AI doesn’t know what kind of knowledge is actually in the vector store unless you explain when the file search is useful, and what the expected results are.
- OpenAI uses language in multiple places indicating that it is the user that “uploaded files”, and that their knowledge is in the file search - not the developer.
- The AI can call the tool with a search query, such as “company address”
- A semantic search is done, which ranks the chunks and returns the top results as a tool result block of text.
- The AI can then treat this like knowledge, and answer from information retrieved across documents
Here are the limitations:
- a maximum of 16000 tokens will be placed, regardless of how big you make chunks or how many top results you specify
- the parameter for the max results simply doesn’t work - you get 20 even if you only wanted to pay for 5 chunks per call
- these are entire sections of documents, across all documents in a vector store, and it is only the similarity of tops that brings documents to prominence.
- tabular data, such as JSON or Excel files, are not permitted - they simply will not work good. Some part of the middle of a JSON or data from Excel without headings or with troublesome document extraction will just be poor, and there is no “search-like” quality when you’ve got 20-50 key/values per such a chunk.
So: if you do have little chunks of knowledge, like customer data or fine-grained results that must be returned individually, you would need to build your own function that can act more like a SQL query, with more parameters to drill down into the type of data you offer.
With a function of your own that does a search, you can actually provide a good description of how to use the function, what it will return, and then make many query fields for the AI to use, such as names, date ranges, other types of metadata. You can also then budget how much you want to return to the AI.
Although the question you ask about the file search tool likely has a bigger question behind it, such as “how do I build this application”, I hope that offers some clarification what OpenAI offers - a generic knowledge base from readable documents.