So let’s talk improvements….
If you know the structure of the document (where the headings/sub-headings start) you can use the heat map to identify the top headings/sub-headings of the doc to present to the model.
This is how humans tackle questions. We look at the table of contents of a doc and focus on reading what we think is the most relevant sections of the document. This is what we want the model to do but it needs to see those sections in their entirety. The model doesn’t need to see the whole document but it needs to see spans of the document in their entirety.
When it comes to reasoning over multiple documents, the model needs to first understand which documents are most likely worth reading. Unfortunately, this a is where semantic search fails us.
Search is classically defined in two dimensions. Precision vs Recall. Recall is a measure of how good the search engine is at returning results that likely contain the users answer and Precision is a measure of his w good it is at getting the order of the results correct.
Semantic Search is, unfortunately, really good at recall but not so good at precision. What that means is that semantic search is likely to return the most relevant chunks for a query but don’t trust the order of the chunks. Even just looking at my screenshot for the chunk results for a query you can see that the chunks in whole are great but the order is less the ideal.
I think there is a fix for this… you need to do a secondary re-ranking pass where you use standard TF-IDF ranking (keyword search) to re-rank the results. I hope to explore adding that to Vectra as I feel like you could build the TF-IDF structures needed on the fly. You normally need a word breaker and a stemmer but I think you can a both by just doing TF-IDF over the tokens of the results. Seems promising