How to build an AI system that can search over 50,000 documents with high accuracy?

francotrigo · June 12, 2025, 12:55pm

@lucid.dev Thanks a lot for the detailed explanation — it really helps to clarify the challenge and the possible paths forward.

Given that my goal is to build a production-grade system capable of answering legal queries based strictly on large document collections, I’m now leaning toward Option #2 (pre-processing with vector storage + retrieval + LLM for reasoning).

Do you (or anyone in the forum) have suggestions on the most scalable architecture for this setup? For example:

Should I use OpenAI’s file search tools or go with an external vector store (like FAISS, Weaviate, Pinecone, etc.)?
What’s the best way to ensure traceable, citation-level responses from the LLM?
Are there any open-source RAG frameworks that handle this multi-step flow well?

Any tips or direction would be much appreciated — especially as I’m trying to avoid hallucinations and maintain legal accuracy.

Thanks again!

Topic		Replies	Views
Building first RAG system API	17	3232	July 6, 2025
We've been building the open source ultimate RAG backend and are launching our V2 Community gpt-4 , plugin-development , api	9	3162	January 5, 2025
Leveraging LLMs with Vast Mechanic Datasets and Guides API api	6	3022	August 31, 2023
Help Needed: Build Chat Assistant Using OpenAI + Next.js App Router + Local Docs in PDF/MDX Format GPT builders gpt-4 , chatgpt , fine-tuning , api , assistants-api	3	511	May 19, 2025
Open AI prompts for RAG / doc Q&A API api	11	7434	January 9, 2024

How to build an AI system that can search over 50,000 documents with high accuracy?

Related topics