Smaller chunks also mean more vectors. And each vector takes time to process.
From an information capacity perspective, if I can fit 4,000 tokens into each vector, vs. 500, I have 8x more information per vector. So if I had 100,000 such embeddings (which isn’t a huge number), I have 400,000,000 tokens of information for the LLM to shape! This is equivalent to 3,000 different 400 page books!
This is in contrast to the 50,000,000 tokens you would have (375 books). So for the information content to be equal, you would need 800,000 vectors, which is starting to get up there. I’m not sure if the speedup with correlating your shorter vectors, because they have 1/3rd the size. So if it’s quadratic (worst case) you have a 9x speedup vs. my 8x more data. So it’s pretty much a wash performance wise.
The other consideration, since I would use RRF, and keywords as the dual stream, I need bigger chunks to reduce the quantization in the keyword representations.
But another consideration is that I do NOT want mismatch between embedding chunks and keyword chunks. I want these chunks to be identical.
You could try smaller chunks with embedding, and larger chunks for keywords, but now with this mismatch, it gets weird when you compare, and try to reconcile which chunk you are going to retrieve. It creates such an imbalance algorithmically, you are going to have to get really creative to make these disparate chunk sizes on the same playing field. Because if you don’t have big chunks, your keyword leg will be crippled, and you might as well drop it and go 100% embeddings.
Your queries from the user will all be small. But this is where HyDE comes in. Especially “HyDRA-HyDE” where you are spinning so many projections off the initial query, and can really create some beefy chunks for the correlation engines to reconcile.
One thought I had about the “Million little facts” situation you have proposed, is clustering the facts based on semantics (embeddings), and creating larger bundles from these correlated facts, and create one large embedding vector for this bundle, and one keyword document from the same bundle. This will produce less overall vectors, and the semantic similarity would (hopefully) keep the AI model following the retriever on-message and coherent.
Since I am thinking the LLM acts as a filter, and filters take big things with lots of information and bandwidth, and create smaller things with less bandwidth, I’m still heavily biased towards BIG data in and little data out approach, at least philosophically, based on past experiences and intuition.
PS. In the “Million little facts” situation where you bundle. If you concatenate the text for similar vectors in the bundle, say using the smaller dimensional embedding models, all you need to do is take the average of the vectors as the new vector representation (maybe scale it to unit vector too), since they are semantically related (close spatially), and BAM, now you have a fast vector leg (smaller dimensions) and a rich keyword document. It’s the best of both worlds!