Assistant can not search the whole file using file search

I am using Assistant v2 from both playground and APIs. I am using File Search tool by uploading a file with some sample data of customers and orders. When I ask assistant to find information about customers, it retrieves only partial results and it seems that it is unable to read the whole file. For example, i have uploaded a json file with 10K customers. When I ask to retrieve grouped information or to analyze the data or i want to retrieve the total number of records, it answer with a wrong number or partial result. Anyone is encountering this issue?

1 Like

Hi!

That is somewhat to be expected, given the way that the semantic search works.

The text is broken into quite large pieces, 800 tokens. About a full page of info. Then the AI makes a search query. It is not a traditional search on keywords, instead it looks for meaning and similarity. The information is no longer a document, it is chunks, maybe not even with the header of a table to make sense of the information.

I would suggest, instead, creating a functions tool that the AI can use, and having it powered by a database. Then you can have the AI query for specific parameters or fields in the database, and return “all that are xxx” from the search.

https://platform.openai.com/docs/guides/function-calling

Then there is simply the AI’s comprehension of what was returned. The better focused the search results, the better the AI will answer.

1 Like

I can use a function calling with my assistant, but I have to implement all the logic to perform the query and the groupings. If AI can not analyze fast my data and give me response based on the files and data I have uploaded, I wonder what the real usefulness of assistants. For example, if I want to analyze older data because I want to extract a pattern or I want to compare order data made in 2019 with the ones made in 2023, I have to implement all the queries and logic to handle this analysis as well. Do you?

The usefulness is that it lowers the skill level required for a human to make the same query, when the query is based on knowledge contained in natural language.

If you made thousands of your own files, each with highly-relevant data, such as small tables with sorted names of customers, individual invoices, all under the chunk threshold, the semantic similarity may work better. But then you may get just customer list results because providing a list of customers looks a lot more like a list of customers than any invoice.

I agree with @_j . Having worked with both structured and unstructured text data in assistants with retreival, its most optimal for structured data to use traditional approaches through function calling as he outlined above. However through my testing I have found that structured or unstructred text data will always perform better in file search with the data broken down as small as possible without losing relevant context.

1 Like

Hi @_j :wave: I agree with you but I have also tried to upload smaller files (10 json files with customer and order data), and the result was not as I expected. I had added in a single record both customer and order information, thinking that in this way the research was better. In your opinion, how can I use vector store and files to enhance my assistant knowledge? For example, I am implementing a sort of copilot on Salesforce platform using Assistant APIs. I want the assistant to know both the context of the record I am viewing on Salesforce, but also a set of historical and behavioural data not related to the current record. So my need is to give Assistant a sort of output during its responses. How can I accomplish this?

Hi @carl.g.brown :wave: your approach is good, but in my use case for example my assistant need to know a lot of historical data. For example, I have 300K orders data of customers who had purchase a tour operator travel (this is one of the contexts I am working for). My questions are:

  1. what is the better format my files must have to be indexed in the best way
  2. what is the optimal size (how many records per file)
  3. in what way a function calling can help me to perform queries on database, if Salesforce platform has itself a series of limitations and language-restrictions?

That is a large data set. If you are set on using RAG for this approach, a graph based approach may make more sense. Unfortunately you do not have that control with OpenAI’s current vector store implementation.

  1. I typically find for file search with structured data, passing via .csv is the best.
  2. Size has become less improtant for me and I focus more on categorization. Say out of your data some purchase only one or some purchase more than one. That would be a good category seperation to divide into seperate files. If you keep compounding more differences it will naturally break the files down into smaller sections. The more categories the eaier this is, but can be hard if there is not much variance in the data.
  3. I don’t have a good answer for this question as I typically don’t use AI to query my structured data through function calling, but instead pass the relevant data as context through a user selection process. This helps to refine the type of data needed and allows for traditional data retrieval which is much faster.