Document Sections: Better rendering of chunks for long documents

I wanted to share with the community some work I’ve been doing around Retrieval Augmented Generation (RAG) using long form documents. I’ve developed a new technique called “Document Sections” which I feel is superior to traditional techniques being employed today.

Here’s an example of the chunks returned by a vector DB like Pinecone. This is the chunk output from Vectra, my local vector DB project, for a query over a small corpus of documents from the Teams AI Library I designed.

The query is “how does storage work?” and you can see that while the chunks are relevant, they’re completely out of order. Traditional RAG approaches just add the text from the chunks to the document in the order they’re returned from the search engine. You fill up the context with as many chunks as your token budget allows. These chunks often have 20 - 40 additional tokens as well, for added context, which can result in duplicated tokens being presented to the model.

Let’s look at the same query but with the returned text organized into Document Sections:

Everything is in the correct order and free of any duplicated tokens. Order doesn’t always matter to the model but if you’re asking the model a question like “what are the steps to do XYZ task?”, order is super important. Showing the model the steps out of order can result in the model telling the user the wrong sequence to follow.

In this example, the top document was small so the renderer chose to just return a section containing the entire document text. For longer documents, the renderer uses the chunks to essentially find the spans of document text that most likely contain the users answer. Think of it as using the chunks to create a sort of heatmap for the most relevant parts of the document. It then returns these spans of text as 1 or more Document Sections.

The renderer is passed the desired size of each section (token budget) and the number of sections to return. Most RAG implementations send a single set of chunks to the model and ask it to answer the users question. This can work for simple questions but I’m interested in having the model answer more complicated question the way a human would. My plan is to ask for 3-5 Document Sections and then present all of these sections to the model in parallel. I’m going to ask it to draw some initial conclusions relative to the users question and then present all of the conclusions to the model to generate a final answer.

So how does the algorithm work? All of the chunks contain startPos & endPos offsets and a documentId . No text is stored with the chunks to keep the chunks as small as possible. The original text is stored externally so it needs to be retrieved at render time. The algorithm first sorts all of the chunks by startPos so that they’re in document order. It then groups the chunks into sections based on the token count for each chunk. A pass is then made to merge any adjacent chunks. For each section the algorithm calculates the remaining token budget and then fetches additional text to fill in the gaps around the sections text spans. This essentially makes the overlap text dynamic and maximizes the density of the returned sections. The scores for chunks within a given section are then averaged and the sections are then returned sorted based on their average score.

The end result is that you get back one or more sections of contiguous text that are most likely to contain the users answer. There are a number of ways this algorithm could be further improved which I’m happy to dive into if anyone is interested.

Here’s a link to the algorithm if interested:

23 Likes

Great write up … head is spinning now!

I like the heat map and continuous chunk concepts.

If latency allows, wondering about a dynamic approach where you continuously grow text around certain high correlations, to grow the correlation at runtime by continuously re-embedding expanding chunks centered on the initial hit. Like a true heat map. And use these “grown” chunks as representations for RAG to consume.

You could cache these “growths” and watch your embedding tree grow over time as live searches came in.

Could also shift growth up or down in the text offsets (so non-centered), and use the embedding correlation as a sort of gradient to achieve optimization, and cutoff after a certain point of lesser correlation.

With this, your embedding dynamically adjusts to your users, instead of artificially chunked across a non-semantic grid.

6 Likes

Interesting work Steven, I’ll be praying close attention to this as you move forward.

I’m currently working on an embedding project using a large number of academic papers. This topic has got me thinking about how to structure retrieved document snippets which may come from several different papers.

There’s obviously the publication date, but also the citation network which can be employed for ordering retrievals.

I see some opportunity to take this even farther, and if my suspicion is correct, could change how people use embeddings going forward.

Because, now I’m also thinking one could do some clustering on the retrieved embeddings to group together parts from different documents which share a common thread of relevance then within and between cluster organization could be optimized based on dates, document position, and the citation graph…

Then the model could run a within-cluster pass to draft cohesive pseudo-documents, which could then be merged in a second between-cluster pass to create a full reference document to inject into context.

We’ve seen many times that the quality and structure of a prompt has an outsized influence on the quality of a response, so it’s not hard to imagine tossing a bunch of randomly arranged snippets into the context would yield worse results than an integrated approach.

I think you’re very much on to something here, and I would encourage you to run with it.

Find or create a benchmark for responses with naive embedding context, intelligent ordered embeddings, then with a pseudo-document constructed from retrievals.

I’m seriously very excited about this, it’s like a corollary to HyDE, but instead of generating a fake answer to use to find relevant embeddings you could be taking relevant embeddings to generate a fake document to put into context…

There could absolutely be a publishable paper in there for you!

7 Likes

Great ideas… These ideas are about a week old but I feel like they’re the start of something groundbreaking on the RAG side of things. There are so many directions this can be taken. Happy to share my improvement ideas as well. But please keep thoughts coming.

2 Likes

Thanks! My goal was to inspire all the super smart people on here. I feel like this is potentially a dramatic shift in RAG but I was chastised the last time I used the words “State of the Art” so I’ll let the community decide for itself :slight_smile:

My goal like yours is to get the model to answer complex questions across multiple documents. The heat map idea let’s you find the most relevant text spans within a given document and I think what you can do is just take the top section from the top 5 documents and ask the model to draw conclusions across those sections. You can then ask the model to generate a final answer based on those conclusions and there you go… you’re reasoning over multiple documents.

Lots of room for improvements.

2 Likes

:flushed::face_with_open_eyes_and_hand_over_mouth:

I think that was me… sorry.

I think you’re absolutely right here.

Doing lots of out-of-channel work with the model, embeddings, context, and even the prompt—basically scratch-work—to get everything buttoned up nicely before sending it off to the model to generate an actual response will almost certainly lead to dramatically better results.

One of the other ideas I keep kicking around concerning embeddings and retrieval is using something like a 10:1 synthetic:authentic embedding ratio.

My original idea being that since so much of the embedding cosine similarity score is wrapped up in the structure of the text, if you were to take every text snippet you’re embedding and have a model re-write it using a variety of different guidelines, then embed those as well, you will have essentially increased the footprint of each document you want to retrieve.

But now… you’ve got me thinking on a slightly different track…

If you were to take all of your embeddings, cluster them, then generate synthetic documents based on the clusters and embed those, you’d basically be pre-processing the work of combining documents.

This would be hugely inefficient as you’d never use the vast majority of those and they would scale exponentially, but I can imagine some potentially narrow applications of it.

Indeed!

1 Like

So let’s talk improvements….

If you know the structure of the document (where the headings/sub-headings start) you can use the heat map to identify the top headings/sub-headings of the doc to present to the model.

This is how humans tackle questions. We look at the table of contents of a doc and focus on reading what we think is the most relevant sections of the document. This is what we want the model to do but it needs to see those sections in their entirety. The model doesn’t need to see the whole document but it needs to see spans of the document in their entirety.

When it comes to reasoning over multiple documents, the model needs to first understand which documents are most likely worth reading. Unfortunately, this a is where semantic search fails us.

Search is classically defined in two dimensions. Precision vs Recall. Recall is a measure of how good the search engine is at returning results that likely contain the users answer and Precision is a measure of his w good it is at getting the order of the results correct.

Semantic Search is, unfortunately, really good at recall but not so good at precision. What that means is that semantic search is likely to return the most relevant chunks for a query but don’t trust the order of the chunks. Even just looking at my screenshot for the chunk results for a query you can see that the chunks in whole are great but the order is less the ideal.

I think there is a fix for this… you need to do a secondary re-ranking pass where you use standard TF-IDF ranking (keyword search) to re-rank the results. I hope to explore adding that to Vectra as I feel like you could build the TF-IDF structures needed on the fly. You normally need a word breaker and a stemmer but I think you can a both by just doing TF-IDF over the tokens of the results. Seems promising

2 Likes

That’s not too different to what I’m suggesting that you present the model with the most interesting span of text for
The doc and ask it to draw some initial conclusions. You’re basically asking the model to summarize the text it’s seen in the co text of the question it’s being asked.

2 Likes

I think it really depends on the questions you’re asking the model. I care about accuracy first and foremost so to me, presenting the model accurate spans of text from the source documents is paramount.

1 Like

The reality is that you can not present the entirety of a long document to the model so it’s really about what do you show the model? As humans we rarely read whole documents so my goal is to mimic the skimming that we do. We will read a chapter or text under a heading and use that to build towards a final conclusion. How do we get the model to do the same?

Of course! And this could also accomplish that. If you were to get a hit on a synthetic document, you’d (presumably) have the ID of original source document linked in the vector database, this would just give you more opportunities to find that relevant document.

Regardless, there are so many avenues to improve retrievals I think we’ll see this landscape change rapidly over the next year or so.

1 Like

Great work on the algorithm @stevenic. This will provide a much more coherent context to models.

I really like @curt.kennedy 's idea to augment the KB with cache.

I noticed that embeddings tend to lose performance with larger chunks of text, however reducing the chunk size would increase the performance of embedding but increase the retrieval time because of the increased number of chunks. There has to be an optimal token length that can be used.

2 Likes

I see what you’re saying… sorry. I honestly wasn’t sure you were in support of my idea at first lol

I feel like this is potentially RAG 2.0 :slight_smile:

1 Like

Let’s use this thread to work out RAG 2.0 as a group. I would love it if a goal was to enable reasoning across not only long documents but multiple documents.

3 Likes

Vectorizing collections of documents somehow maybe? As in write a description to tie them together…

Turtles all the way down…

The tricky bit with multiple documents is two fold… 1) how do you know the docs to reason over? As I suggested, semantic search is good at recall but not return documents in the ideal order. A secondary TF-IDF re-ranking pass could fix that. 2) how many docs and sections per doc do you reason over? This is largely an exercise in prompting and getting the model to conclude if it’s done. A third issue is that model often sucks at decisions like this.

Right, what I (poorly) suggested was a way to “weight” the documents… like meta data about each one … maybe a “priority” ranking…

I’ve admittedly not done a lot with embedding (yet), but it’s on my horizon for my next project…

ETA: Maybe the doc is “more important” if it has more than one vector hit inside?

Yeah I was thinking something similar @curt.kennedy has spent a lot of time with cosine similarly rankings and knows that the scores aren’t very well distributed. There has to be a better ranking algorithm out there.

How is that every score is 0.8xxx?

I have been developing my own hybrid retrieval system using both embeddings and keywords. The keyword algorithm is something I developed called “MIX” which stands for Maximum Information Cross-Correlation. So I use basic information theory on the rarity of the word within your own corpus (your set of documents, or embedding chunks, for example) and use this in conjunction with the embedding vectors to fuse a single ranking using the reciprocal rank fusion algorithm.

I have the MIX part done, and hosted as a serverless API in AWS. If I get a few hours, I will finish the embedding leg (which I’ve already done before) and the hybridization ranking fusing both.

The good thing about keyword searches here is that I could define the centroid of most information bearing keywords in the document and correlated to the incoming user query, and then form a unique chunk around this centroid, and verify meaning with embeddings.

Note: Each word has a different information “magnitude” and this is taken into consideration in the correlation . This along with the local frequency of the word (on both user incoming, and document corpus sides) … the frequency increases the correlation, but only logarithmically, so that “unique keywords” dominate the correlation.

The offset of the keywords is random, and “floats” around in the document, whereas embeddings are chunked in advance, largely arbitrary and without consideration of meaning, since you find meaning after you chunk and embed.

So this “keyword led” search followed up with expanding embeddings “grown” from this centroid, or collection, might be the way to go.

The only problem with keywords, is that sometimes they are not present in the user query. Which is why I hybridize with embeddings, since no matter what there is always a closest embedding, and hence a chunk to examine and feed the LLM.

PS. One thing I forgot to mention was the notion of “keyword seeding”. A good example is when a consumer is asking questions about a topic they have no knowledge of, for example, insurance policies. In this case, where your corpus is largely technical/legal/etc, what you want to do to solve this is take the incoming user query, and generate and augment keywords related to that query. You can use embeddings or a classifier for this. Then, the seeded keywords that are unfamiliar to the user are used in the query as well, which does two things: (1) translates common questions into your technical space and (2) increases the search relevance … hence improving the LLM response quality.

3 Likes

Something I have wanted to dig into was taking a mixed approach using both lexical and semantic searches. Are you open sourcing your work? I’d love to look at it.

You PS makes me think of phind, with them rewriting the query. This is an approach I had been taking but I am going to see how generating additional keywords.

And reciprocal rank fusion is, at the very least, something I need to deep dive into. Thanks for sharing your workflow, this is great stuff.

2 Likes