Increasing the number of files in the file search vector store

As per the OpenAI documentation the maximum number of files We can have in file search is 10000. I have alot more documents and want to increase them. How should I do it. Is there any approach to increase them or should I use any other framework like LlamaIndex.

You cannot add more files than that to a vector store.

You might consider just combining files of you really need to get more information into the model than is allowable in 10,000 files.

If you’ve got that much data you’re dealing with though, I strongly suggest you curate that data to eliminate as much unnecessary or redundant information as possible rather than just uploading whole files with no regard for what extraneous bits might be included.

I really respect your opinion. but currently I am working on project in legal industry and the documents are much sensitive to each case with relevant information. so combining them or discarding some of them is not an option for now. We have already curated the document and organized them well. current they are above 10K level and will keep increasing. What other solutions you suggest? any resources to include?

Putting them in the vector store combines them.

It’s just one big vector store of embeddings.

For most use cases it doesn’t really matter if you put in 50 1-page documents or 1 50-page document.

If you absolutely need to keep the documents separate for some reason, you may need to go with a different vector database, either run locally or as a service.

Probably the best, first place to look is at Pinecone.

I’ll see if I can get any information about if the limit here can be increased, but I’m not expecting it will be.

As per current update we cannot increase the 10000 documents limit as it is by default by OpenAI. Now the next concern is If I use external vector store like Weaviate the quality of the RAG pipeline may decrease as we are not using default OpenAI vector Store whose configuration information is not well known. I am concerned how to setup Weaviate to have similar performance like OpenAI vector Store.

I reached out to OpenAI and unfortunately the hard limit for number of files in a vector store is 10,000 without any announced plans to increase that limit.

So, if you absolutely need to keep the files separate you’ll need to use a different RAG solution.

But, if you have already done the work of converting all of the documents to plain text, it would be trivial to concatenate them to condense the file number to 10,000.

Each file can contain up to 5-million tokens which, in English, is like 3.75-million words (10-20 university textbooks worth of text). Multiply that by 10,000 files and you’ve got plenty of space for just about any amount of text.

The only thing you lose with concatenating the files is the ability for the model to directly cite which file a piece of information was retrieved from. But there are ways to ameliorate that issue (e.g. peppering the document with comment tags identifying the document and any other metadata you want the model to access).

Anyway, good luck with your project no matter which way you decide to go!