Document Sections: Better rendering of chunks for long documents

The code is very implementation specific to the serverless environment, so not sure how much help the code is.

But the algorithm for MIX and RRF is really simple and easy, here it is:

For MIX, take the IDF information theoretic version from BM25 (with a slight tweak, as noted below):

In the last equation, I just multiply each term in the sum with log(1 + r), where r is the frequency of qi in the input, for the input information, and for the output information r is the frequency of the word qi in the specific document. The total information is the sum of the input and output information in the common qi terms as you loop over each document. The qi’s are determined by simple set intersection, and you can do this quickly in-memory. Retrieve the n(qi) only for intersected words using a database. So set intersection is the filter, prior to any database queries.

So in theory, MIX is a sparse vector with 50k-100k-ish dimensions (your vocabulary size) but this is how you efficiently handle the sparse correlation (i.e. set intersections).

As for RRF, here is all that is:

Just use k = 0 for ones based indexing, or k = 1 for zero based indexing.

I think the whole project took 200 lines of code for me to write.

When generating the in-memory data, I have a cutoff of R/N, where R is the overall number of documents a word W is in. Then I threshold this, creating an automatic stop word filter. So it adapts to your vocabulary.

So the ideas of the algorithms are all out there in the open. The fun is implementation. I spent 10 hours on implementation and 10 minutes on the algorithm. :astonished: That’s how this stuff usually goes. :rofl:

I think the cool part will be trying to “grow” embeddings from keyword dense sections or high correlation sub-embedding sections of chunks of documents. This is one of the key requirements that @stevenic is addressing in this thread – which is continuous, coherent chunks, not “scatterbrained” non-coherent chunks. Also, you should probably spin multiple RAG answers in parallel (from these coherent chunks) and then reconcile them into a single cohesive answer.

PS. For those not familiar with embeddings, you just take the dot-product (in-memory) and refer to your database with the hash of the highest correlation content. Very similar, but complimentary, workflow to the sparse (keyword) search methods.

1 Like