Document Sections: Better rendering of chunks for long documents

stevenic · August 25, 2023, 5:45am

So let’s talk improvements….

If you know the structure of the document (where the headings/sub-headings start) you can use the heat map to identify the top headings/sub-headings of the doc to present to the model.

This is how humans tackle questions. We look at the table of contents of a doc and focus on reading what we think is the most relevant sections of the document. This is what we want the model to do but it needs to see those sections in their entirety. The model doesn’t need to see the whole document but it needs to see spans of the document in their entirety.

When it comes to reasoning over multiple documents, the model needs to first understand which documents are most likely worth reading. Unfortunately, this a is where semantic search fails us.

Search is classically defined in two dimensions. Precision vs Recall. Recall is a measure of how good the search engine is at returning results that likely contain the users answer and Precision is a measure of his w good it is at getting the order of the results correct.

Semantic Search is, unfortunately, really good at recall but not so good at precision. What that means is that semantic search is likely to return the most relevant chunks for a query but don’t trust the order of the chunks. Even just looking at my screenshot for the chunk results for a query you can see that the chunks in whole are great but the order is less the ideal.

I think there is a fix for this… you need to do a secondary re-ranking pass where you use standard TF-IDF ranking (keyword search) to re-rank the results. I hope to explore adding that to Vectra as I feel like you could build the TF-IDF structures needed on the fly. You normally need a word breaker and a stemmer but I think you can a both by just doing TF-IDF over the tokens of the results. Seems promising

Topic		Replies	Views
RAG is failing when the number of documents increase API	35	17913	December 17, 2024
Using gpt-4 API to Semantically Chunk Documents API embeddings	186	22597	April 2, 2025
The length of the embedding contents API	48	33817	December 13, 2023
What's the most accurate? Fine tunning vs Prompt Stuffing Community fine-tuning	13	5013	October 2, 2023
Creating a Chatbot using the data stored in my huge database Community embeddings , chatgpt , fine-tuning , api	93	86461	November 25, 2023

Document Sections: Better rendering of chunks for long documents

Related topics