I think you have to first think about your chunking strategy. You want your chunks to contain entire thoughts and not thought fragments. Remember, these chunks are retrieved, and then presented to the LLM in the prompt, so they need to make some sense.
If your chunks are too small, they likely won’t contain entire thoughts, but your embedding of this small chunk is very precise. If your chunks are too big, they contain lots of thought, and your embedding is less precise because of the varying amount of information in the big chunk.
And so this is where all the fun begins. How do you solve this problem? Or do you just pick a chunk size and percent overlap and call it good
The latter is what most people do, because that is what a lot of these automated chunking tools do.
But the rabbit hole just gets deeper. Why not chunk at various levels? Then with some amount of awareness of how these chunks are related to each other, you can formulate a more optimal chunk, by finding a large amount of smaller chunks within a single large chunk, and then retrieve the larger chunk that corresponds. So you get the best of both worlds, precise embeddings, and more cohesive thoughts per chunk. You can see where this is headed … it’s an optimization problem.
You could run sentiment, entity extraction too. You just need to figure out how to use this from a search perspective.
One thing related to this is keyword search. A simple one that I have implemented is a log normalized TF-IDF. This works best for your larger chunks. What you do is break down each chunk into a bag of lower cased words. Then you essentially count each word in the chunk. You also count how many total chunks each word occurs in. You can use this information to find relevant keywords, specific to your data. As the most important keywords are the ones that have the lowest repeats and lowest presence across all the chunks.
This keyword search leg can be combined with your embeddings, by ranking each stream separately and combining them with hybrid search fusion algorithms such as RRF or RSF (ref).
RRF is easier to implement, but RSF may be more precise, especially when there are multiple ambiguous correlations.
So, that’s pretty much it. There are a lot of variations to this theme. It’s certainly a fun and challenging problem to optimize. But, as you alluded to, it’s definitely more on the retrieval side IMO.