By having fewer properties, yes. Although whether or not that improves your RAG performance overall will depend on whether that elevates the intended chunks as intended, or results in many very similar smaller chunks which then confound performance by virtue of having lost their distinctive properties.
Curt’s treatment is really the most informative, but basically you can improve cosine similarity either by making the question look more like the stored vectors (more enriched questions), or you can make the vectors look more like your query (smaller chunks).
Alright, Gentlemen, I thank you for your input. I’ve come up with a plan.
I already have built into my architecture the ability to add a summary of the main document to each chunk. That summary is part of the vectorized object. So, I should be able to add code that would summarize the chunk itself. So, in essence, I could reduce the size of each chunk of a Bible Chapter, and then add a modern English summary to that chunk
In addition, I can gear the System and concept prompts to give a more clear indication to the AI we are dealing with Biblical texts.
My plan isn’t exactly the above, and I do wonder what effect duplicating the text will have, but it at least is pushing me in the right direction while still utilizing the infrastructure I currently have in place.
When you summarize, you could be losing some information.
But, I can see the value in summarization, because you get smaller chunks, and maybe you can “modernize” it as well. Just beware of information loss, things dropped after summarizing.
So try it and see. But I think you know the path is really governed by one simple statement, which is the same text embedded twice gives the exact same (within roundoff) embedding vector.
The closer you can shape the query and targets into similarity, the higher this correlation.
But you don’t want to add to much fake information, or drop critical information in this process, because both of these will fight against your true correlation scores.
So keeping these ideas in mind, transform away! There is no one-size-fits-all answer because each domain can have its unique quirks.
So, I’ve began re-working my code to accommodate this “Bible translation” scheme. I’ve not yet been able to reduce the chunk size or prompt – I actually need to move this dataset to it’s own site for that. But, I was able to jury-rig the ability to create translated (into Modern English) versions of each Bible text chunk. This is the result:
Yes, the model still says it doesn’t understand, but this is the first time I’ve gotten the relevant passage, Acts 20, to appear in the top 5 results for this query since I’ve been trying. Normally, it appears after 20 or so citations.
That means, translating the Bible to a more modern English, which is the the vernacular most people today (from English speaking countries, that is) would ask questions, is a solid plan. And, I believe reducing the chunks size will reduce the noise, and assure that the model can actually understand the documents being returned.
Wow. It’s going to take a minute to translate this thing, but once done, it looks like we’re going to have a pretty good Bible Q & A system. Thank you!
I almost forgot, this was YOUR idea! And, my first reaction was like, “Are you kidding me? Translate the entire Bible?” But, as fate would have it, that’s what is working so far. Thank you!