Source document chunk identification and highlighting for RAG usecase

,

Hi

I’m building an equity research assistant to answer any user queries on public company filings using RAG approach. I want to know about the best approaches for identifying and highlighting chunks of the source documents from which the LLM provided the answers from.

Here are the steps I currently follow for my app

  1. Data Chunking - Process each pdf document, Iterate over each page at a time, extract the page content,
  • if page content is less than 512 tokens, create embeddings for the content
  • else split the content into equal chunks
  • & store the text, embedding, page, chunk number for each chunk
  1. Querying - When user asks a query, perform the similar search in vector db, get the top 10-20 chunks and send it to LLM
  2. Displaying response to user - I stream the LLM response to the front-end and at the end of stream - return all source document chunk details that were found in the querying stage. (Currently using vercel ai sdk for achieving this)

Couple of improvements I want to make:

  1. Since the LLM answers user query from small subset of chunks instead of all the top 20 chunks I’m passing - what is the best approach to get these subset details as part of the LLM response so that I can show only the chunks relevant to answer. Can it be achieved

  2. I want to navigate to the page and highlight the exact chunk text when user opens the source document - How to achieve this?How do the apps like chatpdf or chat doc highlight these chunks on pdf?

  3. Currently my chunk size is 512 tokens, it means half the page gets highlighted - Is there any approach to highlight only relevant text within the chunk?

Need your recommendations for above improvements. Let me know if there’s any existing solution too

Hi, @sahas, and welcome to the forum.

Ideally, you would pre-process your text and split the chunks using semantic chunking (where each chunk more or less represents a single idea, so that the embedding vector does not contain meaning “noise” and your RAG engine gains in precision).

That will introduce another issue, where some of the chunks being short (but extremely precise) will be not enough for your models to get the surrounding context to properly answer the question.

This issue can be easily solved by adding extra chunks to your vector database. Those chunks will be “containers” and be outlines of your sections built with smaller chunks. This operation needs to be recursive so that you have containers throughout all your document depth levels.

The good thing is those extra chunks will be the summarizations of your documents as a whole and the inner sections. That approach will allow your RAG engine to be able to search both the deepest levels (individual paragraphs or sentences) and general items (high-level documents) at the same time.

Just when the inner content is needed for a section, your app will have to pull the children of the section directly by doc ID and section path (so choose path naming conventions wisely to be able to do that).

Then you need to create the object schema to store your data in the RAG engine. Those are merely JSON schema for data items containing the fields that can be used to build your prompt or perform operations in the app logic.

The basic fields would be:

  • :id: of the item
  • title of the item
  • summary/description of the item
  • text content of the item
  • item outline (vector of this outline will match the item’s central idea)
  • item document ID (the ID of the document the item belongs to)
  • item path in the document (a sort of an address inside the document)
  • item parent (ads some extra context to differentiate the similar items inside the same document and parent section context)
  • item type (optional but useful for processing in-app logic)

When embedding the items, not all friends should be included in the vector. Personally, I think you can easily skip the following:

  • id
  • doc id
  • type
  • path

As those are pretty useless semantically.

Also, do not include the item labels in the vector (if you have this ability, eg using weaviate DB).

Querying:

Personally, I find the following approach produces the best results for me (my use case is legal doc analysis):

The query should be improved before running. This may imply adding some additional questions to it and or examples of how the text might look like in the found items.

The found chunks will be more or less in clusters, where a group of chunks will have close similarities between each other and stand a little bit apart from other clusters (see closer the similarity scores, and you’ll spot it easily), so instead of using crazy coding to isolate clusters to define how many chunks you would consider, I would recommend using (again) weaviate as it has a query parameter of how many clusters to return instead of chunks (you can choose either, but clusters are easier to account for).

Then, a trick from Serge, if you want to save tokens on high-quality answering models and generally improve your results, pass the found chunks through a “validation filter” - a cheaper model whose only task is to answer yes/no (1/0) to one question: does this context contain the answer to the question directly or adds necessary details to the answer?

This simple trick will trim the context you need to put in front of the final answering model by like 50%, keeping only valuable chunks and improving the quality of the answers.

Sure, this takes time, so most of the operations should run in asynchronous/parallel requests so that the user does not burst into tears waiting. If you can run more than one query (standard questions) at once that might be awesome. Currently, in my doc analysis, I run about 70-90 queries (retrieval, chunks filtering, answering, so total API calls for that go close to 300 to get the questions answered) in about 5 seconds, then the answers are used to build the analysis report document. But this will depend on your application workflow.

If the RAG engine party sounds intimidating, you can check this thread (below) where similar processing can be done via an API call to an external service, so that you get your data items ready to go in the RAG engine: Semantic Chunking of Documents for RAG - API Tool Launch

As for highlighting the selected chunk, I had to build my own UI for the app to use item IDs returned by RAG to highlight elements on the screen.

To get the item ID, once the answering model finished its job, I pass the answer along with the items used as context to a fine-tuned model that returns me the IDs (it sees them in the presented context, the fine-tuning was to teach the model to select the correct chunk and not go wild with IDs output) of the items used to form the answer. I do it after the answer to improve the quality of the app (legal supposes 0% hallucinations), and the selecting model does need to have the answer along with the context to be able to do the task properly.

Hope that helps.

BTW, if you want to see what the simantiks API does in real business apps, send me an example of the file, I’ll send you the resulting JSON back to see if that’s something you are looking for.

2 Likes