Optimal way to chunk word document for RAG(semantic chunking giving bad results)

I have a word document that is basically like a self guide manual, which has a heading, below procedure to perform the operation.

Now the problem is ive tried lots of chunking methods, even semantic chunking, but the heading gets attached to a different chunk and retrieval system goes crazy, whats an optimal way to chunk so that the heading + context gets retained ?

1 Like

CLAREDI

Context Length Aware Ranked Elided Document Injection


In this context-aware chunking scheme I propose for semantic database knowledge retrieval, we are re-assembling a document into a small version just focused on the relevant information.

  1. Splitting: Document is chunked based on section identification, and then sub-chunked on token count if further division is necessary for piecing into an AI context length.

  2. Enrichment: A summary and section navigation metadata is included in each chunk text. This gives similarity results that prefer common documentation sources when working with a rich variety of possible knowledge. Additional piecing index information is added as out-of-band metadata.

  3. Embedding for semantics: this already provides quality, but the embeddings can be further infused with example AI synthetic questions that target the content. Because AI language is so much more expensive than embedding, this is certainly optional, but is basically HyDE that is prepared instead of delaying on-demand search by improving user inputs.

  4. Top-k token threshold search: we have an exhaustive search on the contextual information, and we also are given a size target and a cutoff. Without relevance, you can get nothing returned, and you get max tokens you set when much documentation is relevant.

  5. Document reconstruction: This is the key component: we rewrite the chunks back into a summarized headline document, where the AI is given the appearance of an article that clearly has sections removed, elided, that is built out of the retrieval chunks, with indexing included. If there is token budget, we can also weight the surrounding chunks with a boost to see if they should be included.

  6. Injection: the AI gets its “Document snippets relevant to the most recent input”. This all happened with just one more AI embedding of user input and its lead-up.

  7. Read more? If the AI is on the right track, and just needs to read into prior or following chunks itself, a function can allow it to place more into the elided document that is assembled - we don’t give a tool return, we give better document.

This structured RAG method meets the goal of retaining heading & context, providing high quality documentation suitable for AI comprehension.

3 Likes

What do you mean by this? I was just working on semantically chunking a book based upon the TOC:

So, the semantic chunks will have titles like:

Rosenzweig | Star of Redemption | I. The Elements | Introduction: On the Possibility of the Cognition of the All

Rosenzweig | Star of Redemption | I. The Elements | Book 1: God and His Being or Metaphysics

Rosenzweig | Star of Redemption | I. The Elements | Book 2: The World and Its Meaning or Metalogic

Or, something like that. The summary metadata will contain the full book and section titles. Each sub-chunk will have the same title, so any document returned in the query will have both the title of the book, the author, the section and subsection from which it is taken.

I think this is an excellent approach: Optimal way to chunk word document for RAG(semantic chunking giving bad results) - #2 by _j. I especially like the suggestions 1-3. As for 7, what we now do is provide a link to the page within the full PDF. This allows a user to read forward and backwards from the context document section . i.e.:

https://s3.us-west-2.amazonaws.com/booksai.org/libraries/labor/SAG-AFTRA_2023_MOA_OCR.pdf#page=5

But, I am curious as to what the issue is you’re facing.

Is there any code that i can reference for CLAREDI

Is there any code for CLAREDI, that I can refer for ? Because I was trying to replicate the same method which was explained under it and I am struck at how to split the pdfs based on the sections by retrieving the main headings also.

Is there any code for CLAREDI, that I can refer to ?