How to build an AI system that can search over 50,000 documents with high accuracy?

Hi everyone,

I’m working on a project that involves building an AI system capable of answering natural language questions based on a large collection of documents—more than 50,000 legal and regulatory files (laws, contracts, resolutions, etc.).

My main goal is accuracy and reliability. The system should:

  • Search across a large and growing corpus of structured documents
  • Return precise answers grounded in the content
  • Avoid hallucinations or fabricating answers
  • Scale to support tens of thousands of documents
  • Provide a way to trace or cite the source used in the response

My main question is:
Should I build a Retrieval-Augmented Generation (RAG) pipeline from scratch, or is there an existing OpenAI-native solution (or another tool) better suited for this use case?

I’d greatly appreciate recommendations on:

  • Best practices for managing and chunking such a large volume of documents
  • When to use OpenAI’s file search tool, vector databases, or external search systems
  • Ways to optimize grounding, context size, and response relevance

Thanks in advance for your guidance!

1 Like

Well, I’ve been working on an in-house system to do something similar, in a sense (or at least, that would be one of the possible use cases of the system).

What you have to consider is context window limitations.

So what you would have is a multi-turn agentic system that:

  1. Takes as input the initial task/query/intention
  2. Has capacity to access a list/directory tree of all files/folders relevant to the query.
  3. Turn-by-turn, pull sets of documents up-to max defined context window size (say, 200k tokens, or whatever, you could test various)
  4. In a given turn, analyzes those files to the extent relevant to the task and saves the output to a structured database, i.e. giving you the “result” for that chunk of files.
  5. Automatically clears it’s own context window, notes as a checklist the files that have already been checked/output saved for, and continues with the next set of files.
  6. After all files have been checked and structured output saved for all files, then continue by reviewing the total output set and doing meta-analysis.
  7. If necessary, re-pull certain documents for accurate quotation/matching (prefer that original output contains explict line numbers/section headers/semantic references so that output structure is explicitly matched to relevant content in original documents for easy retrieval later).
  8. Complete meta-analysis and provide user with final output regarding what was discovered.

That’s how I would do it “using an LLM”. However, depending on your use case, this is something that’s likely accomplished without an LLM or using the LLM only to interpret the users query and convert it into more code-style language/API access to an application endpoint that then uses FAISS or vector searching to connect to the pre-processed database of files.

So you have two options:

  1. Use the LLM and do a multi-turn system that with natural language generates the necessary review and extraction in stages and performs meta-analysis (possibly very expensive, if you are talking about 50k documents that could equate to several million or tens of millions/hundreds of millions tokens, depending on document length?)

  2. Use FAISS or some kind of other language-processing/vector store algorithm application to pre-process the documents, then use the LLM to “ask questions about those documents and access the vector store or other data store through an API integration” that then allows the LLM to review the already processed data and return results.

In either case, it’s a large undertaking.

I’ve been pursuing option 1 in the application I’ve been building on my end, because I’m looking for more semantic analysis at a conceptual level than I am raw relatability/vector storage (i.e. analyzing thousands of pages from multiple complete book sources and building a “conceptual understanding” of the source material in a database in order to answer questions/have discussion about the material after it’s been processed by the LLM).

However I think that actually option #2 is the more robust, and I’ve been turned on to it by another user here in the forums, but it’s as yet a bit over-my-head and not within my wheelhouse.

1 Like

@lucid.dev Thanks a lot for the detailed explanation — it really helps to clarify the challenge and the possible paths forward.

Given that my goal is to build a production-grade system capable of answering legal queries based strictly on large document collections, I’m now leaning toward Option #2 (pre-processing with vector storage + retrieval + LLM for reasoning).

Do you (or anyone in the forum) have suggestions on the most scalable architecture for this setup? For example:

  • Should I use OpenAI’s file search tools or go with an external vector store (like FAISS, Weaviate, Pinecone, etc.)?
  • What’s the best way to ensure traceable, citation-level responses from the LLM?
  • Are there any open-source RAG frameworks that handle this multi-step flow well?

Any tips or direction would be much appreciated — especially as I’m trying to avoid hallucinations and maintain legal accuracy.

Thanks again!

My experience would be to absolutely NOT use the OpenAI file search tools. I’m pretty sure the file limit there is something like 50 files.

I’m not that familiar with the external vector store systems yet, nor am I familiar with open-source RAG frameworks (except perhaps the one I’m working with myself… but I haven’t publicized yet as it’s not quite ready… though I’d be open to developing for a use-case once I have the basic system stabilized).

Regarding traceable, citation-level responses from the LLM, structured prompting/instructions and structured data is your only shot. There is no guarentees however, and I would absolutely NOT use the LLM without “checking it”. However that’s quite possible, if you got the LLM to “quote things directly” your middleware system could actually verify the quotes by pulling the vector and checking the source document.

So it would look like this:

  1. BACKEND SYSTEM Design an application that takes as input a set of documents, and outputs/adds them to either a global (all projects) or specific-project vector store/database. This is standard app development/programming. You could even potentially create various levels of visualization, click-and-drag, ease-of-use as a frontend application, or keep it very simple and bare bones or even CLI style to just “get it done”. Depends on who your developing for/what the end use looks like.

  2. MIDDLEWARE SYSTEM - BASIC CALL Design / use API calls to the application from #1 so that you (the user) can prompt the LLM, and the LLM can respond and “tool call” the backend (you can either use the Responses API or the Chat Completions API, both come with their own caveats) vector database and perform RAG. This again is a whole question of what do you want the interface to look like, what does the user need to see, etc.

  3. MIDDLEWARE SYSTEM - MULTI-TURN / AGENTS For any kind of reasonable response, you are STILL going to likely need a multi-turn automatic agentic system. This is partially where things get complicated. #1 - #2 is standard programming interfaces and a single-turn call to the LLM, but to actually do production-scale/large scale auditing/interfacing with massive datasets, store stages of LLM output, produce chains/loops of reasoning/quoting/back-referencing/etc. you will NEVER get reliable, accurate, or complete results for anything that deals with a very large database in a single-turn call to the LLM. Everything will have to be chunked, and allow for multi-turn reasoning stages.

  • This is a heavy-duty prompt engineering process as well (regarding step #3). It’s exciting, and fun, but very significant amount of labor, testing, revamping, etc. Do existing systems exist? Maybe. Everyone is kind of doing their own thing right now. All that I’ve seen promoted for high degree of SaaS is actually kind of BS and not really relevant to actual specific use-case tools and building custom systems for custom work. Everyone is still scrambling to come up with a general use system, but actually your best bet is to either learn how to code (you already do?) or hire a developer to work with you and help you get this done.

The system I’m developing is intended to take as a “blueprint” a use case like this, and then actually, over the course of several hours/days and thousands of LLM calls (and tens of millions of tokens), allow the LLM to “work alone” and take a project from blueprint → fully functioning application → testing → report back to user.

It’s been about 6 months, and I’m close. I’ll need some good test cases once I get there. But you would need a several page blueprint, complete and total explicit design at all levels. That’s what you start with, whether your coding alone, or coding with the LLM, or hiring a developer to help you. A very strong conceptual blueprint/design, and then the LLM/developer can help you fill out the technical details and assess different options. As it stands, you need to turn your “one line idea” into “2-5 pages of detailed thinking through the entire process at a basic conceptual/technical level about you actually want the system to operate and be used”.

@lucid.dev Thank you so much for this incredibly thoughtful and detailed response. I can see now that building something reliable at this scale is not just a matter of plugging in a vector store and calling it a day — the middleware, multi-turn logic, and validation pipeline are essential.

I’d definitely be interested in learning more about the system you’re building, once it’s ready to test. My use case (legal queries over a large corpus of regulations, laws, and legal doctrine) would be ideal for testing structured citation, traceability, and consistency.

In the meantime, I’ll work on putting together a proper 3–5 page blueprint describing the full workflow, system requirements, expected user experience, and technical expectations. Once I have that, I’d really appreciate your thoughts or feedback — or even exploring the possibility of working together in some capacity if it aligns.

Thanks again for taking the time to share your insights.