Agreed. I am using Weaviate where my metadata is also embedded, and I’ve built in a keyword capability. There’s no getting around the noise – the contracts are what they are. The other hard part I forgot to mention is training users on how to ask the questions in the correct manner and use the tools available.
I’ve been a database developer for over 40 years, and I was a pioneer in the area of Electronic Publishing some 30 years ago. I knew a little something about data structure, and in particular, text data structure, before I got into this AI game. I generally use Semantic Chunking https://www.youtube.com/watch?v=B5B4fF95J9s&ab_channel=SwingingInTheHood and various embed strategies depending upon the type of text (legal, policy, sermons, regulatory code, religious texts, scientific, etc…) I’m working with. And I use metadata fairly extensively.
These are very detailed and extensive contracts which are chunked at the top level by their hierarchal/semantic structures, then at the second level by size… And, there are multiple agreements, over multiple years, so depending upon the scope of the search, similarity results could easily be found in 50+ document chunks. And, don’t get me started if the user expands the search out to multiple guilds (IATSE, SAG-AFTRA, DGA, WGA, AFM, etc…)
But to your point, it could be the document chunk size I’ve chosen is too small – this was done to try and reduce the average cost per query. We weren’t getting better answers at a higher chunk size, so figured it was worth a try. Again, trying to find that fine line.
However, I’ve got a plan. The cosine similarity search is almost certainly going to bring back the most relevant documents. So, all I have to do is configure my system to analyze x documents at a time, retrieve the relevant ones used in the response, and then return either a summary or a concatenation of the individual responses. Sort of like either map reduce or refine summarization strategies, but on a RAG level.
What the end user wants is a complete answer, and I think this is how we get it to him.
Actually, if I could be guaranteed that the model could read 200K with 100% accuracy, it actually would solve this particular problem. But, unfortunately, we don’t live in that world – yet.