Document Sections: Better rendering of chunks for long documents

I have been developing my own hybrid retrieval system using both embeddings and keywords. The keyword algorithm is something I developed called “MIX” which stands for Maximum Information Cross-Correlation. So I use basic information theory on the rarity of the word within your own corpus (your set of documents, or embedding chunks, for example) and use this in conjunction with the embedding vectors to fuse a single ranking using the reciprocal rank fusion algorithm.

I have the MIX part done, and hosted as a serverless API in AWS. If I get a few hours, I will finish the embedding leg (which I’ve already done before) and the hybridization ranking fusing both.

The good thing about keyword searches here is that I could define the centroid of most information bearing keywords in the document and correlated to the incoming user query, and then form a unique chunk around this centroid, and verify meaning with embeddings.

Note: Each word has a different information “magnitude” and this is taken into consideration in the correlation . This along with the local frequency of the word (on both user incoming, and document corpus sides) … the frequency increases the correlation, but only logarithmically, so that “unique keywords” dominate the correlation.

The offset of the keywords is random, and “floats” around in the document, whereas embeddings are chunked in advance, largely arbitrary and without consideration of meaning, since you find meaning after you chunk and embed.

So this “keyword led” search followed up with expanding embeddings “grown” from this centroid, or collection, might be the way to go.

The only problem with keywords, is that sometimes they are not present in the user query. Which is why I hybridize with embeddings, since no matter what there is always a closest embedding, and hence a chunk to examine and feed the LLM.

PS. One thing I forgot to mention was the notion of “keyword seeding”. A good example is when a consumer is asking questions about a topic they have no knowledge of, for example, insurance policies. In this case, where your corpus is largely technical/legal/etc, what you want to do to solve this is take the incoming user query, and generate and augment keywords related to that query. You can use embeddings or a classifier for this. Then, the seeded keywords that are unfamiliar to the user are used in the query as well, which does two things: (1) translates common questions into your technical space and (2) increases the search relevance … hence improving the LLM response quality.