Source document chunk identification and highlighting for RAG usecase

sergeliatko · August 13, 2024, 10:56pm

Hi, @sahas, and welcome to the forum.

Ideally, you would pre-process your text and split the chunks using semantic chunking (where each chunk more or less represents a single idea, so that the embedding vector does not contain meaning “noise” and your RAG engine gains in precision).

That will introduce another issue, where some of the chunks being short (but extremely precise) will be not enough for your models to get the surrounding context to properly answer the question.

This issue can be easily solved by adding extra chunks to your vector database. Those chunks will be “containers” and be outlines of your sections built with smaller chunks. This operation needs to be recursive so that you have containers throughout all your document depth levels.

The good thing is those extra chunks will be the summarizations of your documents as a whole and the inner sections. That approach will allow your RAG engine to be able to search both the deepest levels (individual paragraphs or sentences) and general items (high-level documents) at the same time.

Just when the inner content is needed for a section, your app will have to pull the children of the section directly by doc ID and section path (so choose path naming conventions wisely to be able to do that).

Then you need to create the object schema to store your data in the RAG engine. Those are merely JSON schema for data items containing the fields that can be used to build your prompt or perform operations in the app logic.

The basic fields would be:

of the item
title of the item
summary/description of the item
text content of the item
item outline (vector of this outline will match the item’s central idea)
item document ID (the ID of the document the item belongs to)
item path in the document (a sort of an address inside the document)
item parent (ads some extra context to differentiate the similar items inside the same document and parent section context)
item type (optional but useful for processing in-app logic)

When embedding the items, not all friends should be included in the vector. Personally, I think you can easily skip the following:

id
doc id
type
path

As those are pretty useless semantically.

Also, do not include the item labels in the vector (if you have this ability, eg using weaviate DB).

Querying:

Personally, I find the following approach produces the best results for me (my use case is legal doc analysis):

The query should be improved before running. This may imply adding some additional questions to it and or examples of how the text might look like in the found items.

The found chunks will be more or less in clusters, where a group of chunks will have close similarities between each other and stand a little bit apart from other clusters (see closer the similarity scores, and you’ll spot it easily), so instead of using crazy coding to isolate clusters to define how many chunks you would consider, I would recommend using (again) weaviate as it has a query parameter of how many clusters to return instead of chunks (you can choose either, but clusters are easier to account for).

Then, a trick from Serge, if you want to save tokens on high-quality answering models and generally improve your results, pass the found chunks through a “validation filter” - a cheaper model whose only task is to answer yes/no (1/0) to one question: does this context contain the answer to the question directly or adds necessary details to the answer?

This simple trick will trim the context you need to put in front of the final answering model by like 50%, keeping only valuable chunks and improving the quality of the answers.

Sure, this takes time, so most of the operations should run in asynchronous/parallel requests so that the user does not burst into tears waiting. If you can run more than one query (standard questions) at once that might be awesome. Currently, in my doc analysis, I run about 70-90 queries (retrieval, chunks filtering, answering, so total API calls for that go close to 300 to get the questions answered) in about 5 seconds, then the answers are used to build the analysis report document. But this will depend on your application workflow.

If the RAG engine party sounds intimidating, you can check this thread (below) where similar processing can be done via an API call to an external service, so that you get your data items ready to go in the RAG engine: Semantic Chunking of Documents for RAG - API Tool Launch

As for highlighting the selected chunk, I had to build my own UI for the app to use item IDs returned by RAG to highlight elements on the screen.

To get the item ID, once the answering model finished its job, I pass the answer along with the items used as context to a fine-tuned model that returns me the IDs (it sees them in the presented context, the fine-tuning was to teach the model to select the correct chunk and not go wild with IDs output) of the items used to form the answer. I do it after the answer to improve the quality of the app (legal supposes 0% hallucinations), and the selecting model does need to have the answer along with the context to be able to do the task properly.

Hope that helps.

BTW, if you want to see what the simantiks API does in real business apps, send me an example of the file, I’ll send you the resulting JSON back to see if that’s something you are looking for.

Topic		Replies	Views
How to Optimize Text Chunking for Improved Embedding Vectorization? API vector-db , semantic-search	6	11391	December 15, 2023
Aggregated answer across multiple documents (Q&A) API	6	3413	March 14, 2023
Optimal way to chunk word document for RAG(semantic chunking giving bad results) Community api	5	4712	May 15, 2024
Problem with doing RAG with 300k pages of PDFs Community gpt-4 , gpt-35-turbo , api	8	5936	March 7, 2024
Retrieval with PDFs after parsing that have very similar data and dates API rag	8	143	February 14, 2025

Source document chunk identification and highlighting for RAG usecase

Related topics