Scaling RAG chatbot system to millions of documents

curt.kennedy · February 8, 2024, 8:05pm

Beyond the search strategy optimization. You should also consider what I call the Control Plane. This is Input X \rightarrow Action Y.

The action may influence your search pattern.

For example, “All companies that did Z in the last year”.

So the “in the last year” would create a filter (either pre or post) that you would time gate your results.

The LLM itself doesn’t have reliable time gating, so it has to be done directly in the query.

However, even the “in the last year” may not be 100% reliable, since it would have to be inferred, so a higher SNR solution is to have things like this as user inputs. So the user would set a cutoff date explicitly. Either through a GUI, or through an explicit voice command / text input command, that is acknowledged back by the AI to the user.

Also, don’t forget hybrid search. So embeddings + keywords, or embeddings1 + embeddings2 + … + embeddingsN. Combine all the results back, and fuse into a single ranking.

In addition, using the LLM to reframe the original users input, essentially expand on it, and then use that as the search. So, “Input X”, then ask the AI to produce, for example, “Rephrase Input X from the perspective of a hedge fund manager”. You can do this re-framing from different perspectives, depending on your overall goal (if there is one) with the user.

It can get out of hand, but you could take the users initial input, Q different AI synthesized inputs, and multiple embeddings/keywords, all running in parallel to form some interesting retrievals.

So the compute explodes quickly, and you need to go back to the drawing board, and do more and more pre filtering.

So you need a computationally “cheap” initial filter, followed by a more refined expensive filter. There are many of these, but that’s the general idea.

But I will leave you one to think about … using small vector embeddings as an upfront “cheap” pre-filter, followed by larger embedding vectors as a refinement. You can do this with the new OpenAI embedding models, as I have talked about over here.

Topic		Replies	Views
Ways to deal with prompts larger than model's context length Prompting gpt-4	3	942	July 6, 2024
How to build a Question and Answer Bot for context greater than 2048 tokens? Prompting	3	1667	December 17, 2023
Create ChatBot learned from hundreds documents with embeddings API	1	828	January 14, 2024
RAG Prompt Engineering for better results Prompting gpt-4 , prompt-engineering , rag	3	1760	June 25, 2024
Tricks for prompting when processing much data (long context)? Prompting gpt-4 , chat-completion , long-context	1	272	August 7, 2024

Scaling RAG chatbot system to millions of documents

Related topics