I wanted to share with the community some work I’ve been doing around Retrieval Augmented Generation (RAG) using long form documents. I’ve developed a new technique called “Document Sections” which I feel is superior to traditional techniques being employed today.
Here’s an example of the chunks returned by a vector DB like Pinecone. This is the chunk output from Vectra, my local vector DB project, for a query over a small corpus of documents from the Teams AI Library I designed.
The query is “how does storage work?” and you can see that while the chunks are relevant, they’re completely out of order. Traditional RAG approaches just add the text from the chunks to the document in the order they’re returned from the search engine. You fill up the context with as many chunks as your token budget allows. These chunks often have 20 - 40 additional tokens as well, for added context, which can result in duplicated tokens being presented to the model.
Let’s look at the same query but with the returned text organized into Document Sections:
Everything is in the correct order and free of any duplicated tokens. Order doesn’t always matter to the model but if you’re asking the model a question like “what are the steps to do XYZ task?”, order is super important. Showing the model the steps out of order can result in the model telling the user the wrong sequence to follow.
In this example, the top document was small so the renderer chose to just return a section containing the entire document text. For longer documents, the renderer uses the chunks to essentially find the spans of document text that most likely contain the users answer. Think of it as using the chunks to create a sort of heatmap for the most relevant parts of the document. It then returns these spans of text as 1 or more Document Sections.
The renderer is passed the desired size of each section (token budget) and the number of sections to return. Most RAG implementations send a single set of chunks to the model and ask it to answer the users question. This can work for simple questions but I’m interested in having the model answer more complicated question the way a human would. My plan is to ask for 3-5 Document Sections and then present all of these sections to the model in parallel. I’m going to ask it to draw some initial conclusions relative to the users question and then present all of the conclusions to the model to generate a final answer.
So how does the algorithm work? All of the chunks contain
endPos offsets and a
documentId . No text is stored with the chunks to keep the chunks as small as possible. The original text is stored externally so it needs to be retrieved at render time. The algorithm first sorts all of the chunks by startPos so that they’re in document order. It then groups the chunks into sections based on the token count for each chunk. A pass is then made to merge any adjacent chunks. For each section the algorithm calculates the remaining token budget and then fetches additional text to fill in the gaps around the sections text spans. This essentially makes the overlap text dynamic and maximizes the density of the returned sections. The scores for chunks within a given section are then averaged and the sections are then returned sorted based on their average score.
The end result is that you get back one or more sections of contiguous text that are most likely to contain the users answer. There are a number of ways this algorithm could be further improved which I’m happy to dive into if anyone is interested.
Here’s a link to the algorithm if interested: