OpenAI Assistants V2 - File_Search tool

Hi, With the arrival of OpenAI Assistants V2, I started implementing a use case where my vectorstore contains 100+ pdfs. I attached the vectorstore with the assistants and tried to search for 3 files. say for example, i am trying to query “compare the earnings of Google, microsoft and Apple for the year 2023”. File_search tool brings always brings the response for the 1st company (in this case google) but says there are no files microsoft and apple, though all the files are available in vectorstore. i tried to add instructions to search all the files thoroughly. But it doesnt work and the responses are not consistent. Is it something related to top_k results that it is trying to fetch or is there any parameter that i can provide to enhance the search result? Also,in most of the cases, the citations are incorrect. Can anyone help me on this?

Can someone help on this. This issue is still open and unable to solve.

You have relatively few ways of affecting the results of file_search. The AI writes a query, it gets back what it gets back.

100 files on the same topic is basically going to be mass confusion. Semantic search on large non-contextual chunks of overlapping text with no additional metadata is going to be rolling the dice.

Additionally, the document extraction only gets searchable text. There is no attempt to extract from images or image-based PDF documents.

Use a smart model, perhaps the best that supports retrieval and thus file_search would be gpt-4-1106-preview, although all the “turbos” pale compared to real GPT-4. You can see if it is the model or the search results.

You can now affect the chunking of newly-added documents, giving a token count per chunk, and the amount of overlap. You can also alter the count of returns from the default of 20.

That would take careful consideration of the file, how much data per chunk is needed for it to be “semantic” while still focused. Mixed chunk size documents in one search also could have application, but could go significantly wrong, especially in terms of how much unpredictable context length is loaded up from the unwavering chunk count with no threshold to adjust.

TL;DR: you have a limited and generic search feature the AI must want to use and for which you pay twice for the AI context. Maybe OpenAI’s acquisition of a RAG company will trickle-down.

Hi,
Thank you for the response. I am using gpt-4o as the model. Also 100+ files are in the vectorstore. All are text based pdfs. My understanding is OpenAI’s file_search can search across 10,000 files and can bring good results. So out of 100+ files, i am trying to search for only 3 or 4 files by providing file names in single quote, asking to compare the earnings or income etc…, . It brings the results for 1 or 2 companies but not across all the 3 or 4 files. I added the instruction at the run. But at the same time, today i tried to do the same in the openai playground and got the results for all the 5 files, mentioned. Same assistant is being used in the code as well. what could be the issue?

The file search does not accept file names as a search parameter. The AI can only write a query. This is performed against both assistant and thread vector stores, including user message attachments.

We can only assume that this is logically an exhaustive search across all chunks using the semantics similarity of embeddings-3-large at 256 dimensions as documented, returning the top results, perhaps ordered by rank. Logic, however, would be not returning 20000 tokens of Aunt Millie’s pickle recipes that you uploaded when you are talking about penguins, but that’s currently how it works.

Ok. What could be an alternative for this? Custom embedding in a separate vectorstore with custom metadata will work?

Also i chose openAI assistants V2, because it is mentioned that it can perform file_search across 10,000 files. The responses were inconsistent here as well.

Certainly you can develop your own solution.

Retrieval-augmented generation (RAG) is instead typically an automatic injection of knowledge based on the user input and its context, not something your chat AI writes. AI can transform that input context to be more like the documentation by writing hypothetical answers to embed for search, or you can AI-augment each chunk with more information, like total summaries, or typical questions that could be answered from that section.

You can start with enhancing how documents are chunked, what kind of searchable hierarchy they have included as metadata when sent for embeddings, etc. You can even just give an AI a menu of documents to explore by tool functions, and let it read them, as you or your user seems to have knowledge of the file repository to direct the AI.

PDF files are a very poor repository of knowledge to start from, in my opinion, and it would be better to find where they originate and get text into a form the AI can understand, or do your own document extraction with which you have the opportunity to examine quality.

Hmm… I will give you a sample of how providing file names in a vectorstore fetched results with OpenAI
image
and the response from all the 5 companies looks like the following in playground. Trying some pocs to check how OpenAI Assistants V2 work.
Only thing is it is not like this when tried from code. will look for alternatives.
image

I will give you verbatim reproduction of the v2 file_search tool the AI is provided with, containing the single method the AI has, so you have just as much understanding as the AI does. (this will take scrolling the code box back and forth.)

## Tools

## myfiles_browser

You have the tool `myfiles_browser` with these functions:
`msearch(queries: list[str])` Issues multiple queries to a search over the file(s) uploaded in the current conversation and displays the results.
please render in this format: `【{message idx}†{link text}】`

Tool for browsing the files uploaded by the user.

Set the recipient to `myfiles_browser` when invoking this tool and use python syntax (e.g. msearch(['query'])). "Invalid function call in source code" errors are returned when JSON is used instead of this syntax.

Parts of the documents uploaded by users will be automatically included in the conversation. Only use this tool, when the relevant parts don't contain the necessary information to fulfill the user's request.

Issue multiple queries to the msearch command only when the user's question needs to be decomposed to find different facts. In other scenarios, prefer providing a single query. Avoid single word queries that are extremely broad and will return unrelated results.


Here are some examples of how to use the msearch command:
User: What was the GDP of France and Italy in the 1970s? => msearch(["france gdp 1970", "italy gdp 1970"])
User: What does the report say about the GPT4 performance on MMLU? => msearch(["GPT4 MMLU performance"])
User: How can I integrate customer relationship management system with third-party email marketing tools? => msearch(["customer management system marketing integration"])
User: What are the best practices for data security and privacy for our cloud storage services? => msearch(["cloud storage security and privacy"])



Please provide citations for your answers and render them in the following format: `【{message idx}:{search idx}†{link text}】`.

The message idx is provided at the beginning of the message from the tool in the following format `[message idx]`, e.g. [3].
The search index should be extracted from the search results, e.g. # 【13†Paris†4f4915f6-2a0b-4eb5-85d1-352e00c125bb】refers to the 13th search result, which comes from a document titled "Paris" with ID 4f4915f6-2a0b-4eb5-85d1-352e00c125bb.
For this example, a valid citation would be ` `.

All 3 parts of the citation are REQUIRED.
(expand) - unformatted version of the tool that will word-wrap here

Tools

myfiles_browser

You have the tool myfiles_browser with these functions:
msearch(queries: list[str]) Issues multiple queries to a search over the file(s) uploaded in the current conversation and displays the results.
please render in this format: 【{message idx}†{link text}】

Tool for browsing the files uploaded by the user.

Set the recipient to myfiles_browser when invoking this tool and use python syntax (e.g. msearch([‘query’])). “Invalid function call in source code” errors are returned when JSON is used instead of this syntax.

Parts of the documents uploaded by users will be automatically included in the conversation. Only use this tool, when the relevant parts don’t contain the necessary information to fulfill the user’s request.

Issue multiple queries to the msearch command only when the user’s question needs to be decomposed to find different facts. In other scenarios, prefer providing a single query. Avoid single word queries that are extremely broad and will return unrelated results.

Here are some examples of how to use the msearch command:
User: What was the GDP of France and Italy in the 1970s? => msearch([“france gdp 1970”, “italy gdp 1970”])
User: What does the report say about the GPT4 performance on MMLU? => msearch([“GPT4 MMLU performance”])
User: How can I integrate customer relationship management system with third-party email marketing tools? => msearch([“customer management system marketing integration”])
User: What are the best practices for data security and privacy for our cloud storage services? => msearch([“cloud storage security and privacy”])

Please provide citations for your answers and render them in the following format: 【{message idx}:{search idx}†{link text}】.

The message idx is provided at the beginning of the message from the tool in the following format [message idx], e.g. [3].
The search index should be extracted from the search results, e.g. # 【13†Paris†4f4915f6-2a0b-4eb5-85d1-352e00c125bb】refers to the 13th search result, which comes from a document titled “Paris” with ID 4f4915f6-2a0b-4eb5-85d1-352e00c125bb.
For this example, a valid citation would be .

All 3 parts of the citation are REQUIRED.

This is immediately after issuing a search and then responding, to ensure that were there any dynamic version, it is loaded (a trick OpenAI does on other tools like ChatGPT’s retrieval and browser to dump rules on the AI they try to hide from you)

The AI recieves the file name the chunk came from, but cannot “ask”.

Another important aspect: The AI initially has no knowledge of what the search has behind it and doesn’t know when it should use the search.

Ok. I will try to find alternatives.

it would help if you provided the script which you want help with.