Design advice - one or many Assistants?

Hi, I am wondering about the optimal knowledge scope of an Assistant that uses vector stores to perform filesearch.

My assistant will extract data from say circa 200 files that document large capital intensive construction projects with the aim of populating a database.

Is it OK if all the documentation for the different projects (say 100 of them) goes in the same vector store, or should I create a separate Assistant for each construction project?

I would like to build the information in the vector store over time and maybe open it up for chat-bots, hence the reason for not just deleting and recreating a vector store per construction project for each run.

I thought that providing document metadata to the vector store would solve this problem (guide the Assistant as to which construction project the content belongs to) but unfortunately this is not yet supported. I can add my own metadata data to text files (although this will be destroyed by chunking) but cannot add my own metadata to pdfs so the custom-metadata wont work either.

Any advice gratefully received, thank you A :slight_smile:

3 Likes

Did you make a decision regarding this ? What approach would you suggest ? I have a similar project.

Custom Metadata won’t work. I am also building a Utility and had to convert all my files to .TXT. This was the only way I could make it work.

My advice is to create specialized Assistants.

A vector store with 200 files in it is massive. As long as those files had descriptive names, you could create a workaround for not having custom meta data yet; but this doesn’t help finding information within those files if they are also long.

Ultimately, if you have everything in one Vector Store, then a model would have to potentially search all of those files which GREATLY balloons inbound tokens, thus making an accurate answer less likely.

But this all depends on your purpose, and what you want to draw out of the vector store.

If you’re asking questions about how one project is related to another, then one Store is probably what you need. But if you’re interested in accurate information about each individual project then create a specialized Assistant for each project. Depending on your needs, this specialized assistant could be built in 4o mini for routine search and retrieval.

Then, once retrieved, give the collected data from the many mini Assistants to one smart Assistant for analysis and insight.

Hi, nice project here.

Don’t use vector store, that will explode your budget.

Something like weaviate is more adapted to store your data.

Separate per project - yes (use project id field in your data items)
Isolate data per project - depends, if you might need search across and inside the projects later - then no (performance might be the extra price to pay for this type of searches). On the other hand if projects are huge (yours are big but not huge) and independent from each other - then definitely isolate (weaviate multi tenants).

General approach:

  1. Pre process your data (search my posts on RAG) before creating vectors
  2. Store data items in weaviate to be used by RAG engine
  3. Prepare your analysis/data extraction workflows and operations to build batches of “data transformers”
  4. Run your data transformers on your RAG engine to build the dataset for the project to insert into your regular database
  5. See how you can connect weaviate with your database built on top of it
  6. Write API gateway to access both your database and weaviate instances (make sure you have your open api specs simple and in order)
  7. Give those specs to custom ChatGPT as action
  8. Et voilĂ , you have a pretty damn assistant on your projects with simultaneous access to your regular and vector databases that costs you truly peanuts compared to what it would cost you using assistants or vector stores.

Oh forgot to tell, you can use same API specs with different user keys to build as many custom GPTs (for each project/client/vendor) with different instructions and the final bill will be the same…

@nikitas Hi, at the moment I am still using a single vector store. It is working OK most of the time. I am using 4o. But it is quite expesive. I have about 2000 files in there and 50 projects at the moment. This will multiply much more. I will likely need more vector stores. I may end up with a vector store per country. Will be trying to shift as much of the workflow to 4o-mini as possible. - some of the less complex tasks.

thank you @thinktank for your advice. I didnt think the model actually does the searching, its the retrieval engine (whatever OpenAI use) that finds the closest semantic matches within the vector store, the model just works with the top-k results that are returned? Anyways at the moment I still have one vector store and about 2000 files in it and about 50 projects. Its working OK at the moment, but there is a lot of time spent working on prompts.
Funnily enough my biggest problem is not finding information, but the model inventing info that doesn’t exists.
@sergeliatko I orginally went down a route similar to what you are suggesting but I found it just so time consuming to build the full RAG system when OpenAi has the assistants out of the box. Yes they are expensive, but they are getting cheaper fairly quickly. I may come back to your approach and I am pleased to have learned from you about weaviate, thanks.
@MrFriday you say you had to convert all your files to .txt to make it work. Its not clear to me how this helped? I have converted all html etc into a txt file but not my pdfs. What difference would this make please?

1 Like

I needed different type of metadata to be added in vector store which was not supported by File Search tool of Assistant. So, converted the files to TXT. If the file is supported by OpenAI, then you don’t need to convert anything.

1 Like

I also don’t know, but I think the model must be involved because of some variations I’m experience with search-and-retrieval with 4o-mini vs 4o for the same information.

This is definitely one of the biggest problems I’m experiencing.

It’s more pronounced on 4o with Temperature=1. It is also more pronounced on a default model looking through the whole 400 page document in question, vs a subsection that I’m working with.

use temp 0.4 and make sure the system prompts ends with to use ONLY inforamtion from “database”. Also, upon user query also append this in caps USE ONLY INFORMATION FROM OUR DATABASE TO REPLY. make sure the context ingests is actually added to system prompts under ### START DATABASE ### — inject content retrieved from vector db here — ### END OF DATABASE