Relationship between cosine similarity and chunk size?

bobartig · October 13, 2023, 10:34pm

By having fewer properties, yes. Although whether or not that improves your RAG performance overall will depend on whether that elevates the intended chunks as intended, or results in many very similar smaller chunks which then confound performance by virtue of having lost their distinctive properties.

Curt’s treatment is really the most informative, but basically you can improve cosine similarity either by making the question look more like the stored vectors (more enriched questions), or you can make the vectors look more like your query (smaller chunks).

SomebodySysop · October 13, 2023, 10:59pm

Alright, Gentlemen, I thank you for your input. I’ve come up with a plan.

I already have built into my architecture the ability to add a summary of the main document to each chunk. That summary is part of the vectorized object. So, I should be able to add code that would summarize the chunk itself. So, in essence, I could reduce the size of each chunk of a Bible Chapter, and then add a modern English summary to that chunk

In addition, I can gear the System and concept prompts to give a more clear indication to the AI we are dealing with Biblical texts.

My plan isn’t exactly the above, and I do wonder what effect duplicating the text will have, but it at least is pushing me in the right direction while still utilizing the infrastructure I currently have in place.

What do all you think?

curt.kennedy · October 13, 2023, 11:20pm

When you summarize, you could be losing some information.

But, I can see the value in summarization, because you get smaller chunks, and maybe you can “modernize” it as well. Just beware of information loss, things dropped after summarizing.

So try it and see. But I think you know the path is really governed by one simple statement, which is the same text embedded twice gives the exact same (within roundoff) embedding vector.

The closer you can shape the query and targets into similarity, the higher this correlation.

But you don’t want to add to much fake information, or drop critical information in this process, because both of these will fight against your true correlation scores.

So keeping these ideas in mind, transform away! There is no one-size-fits-all answer because each domain can have its unique quirks.

SomebodySysop · October 16, 2023, 5:24am

@curt.kennedy OMG! You are The Man!

So, I’ve began re-working my code to accommodate this “Bible translation” scheme. I’ve not yet been able to reduce the chunk size or prompt – I actually need to move this dataset to it’s own site for that. But, I was able to jury-rig the ability to create translated (into Modern English) versions of each Bible text chunk. This is the result:

Yes, the model still says it doesn’t understand, but this is the first time I’ve gotten the relevant passage, Acts 20, to appear in the top 5 results for this query since I’ve been trying. Normally, it appears after 20 or so citations.

That means, translating the Bible to a more modern English, which is the the vernacular most people today (from English speaking countries, that is) would ask questions, is a solid plan. And, I believe reducing the chunks size will reduce the noise, and assure that the model can actually understand the documents being returned.

Wow. It’s going to take a minute to translate this thing, but once done, it looks like we’re going to have a pretty good Bible Q & A system. Thank you!

SomebodySysop · October 16, 2023, 5:35am

Oh yeah, and your other suggestion of elongating the prompt also works!

SomebodySysop · October 16, 2023, 5:59am

I almost forgot, this was YOUR idea! And, my first reaction was like, “Are you kidding me? Translate the entire Bible?” But, as fate would have it, that’s what is working so far. Thank you!

vb · October 16, 2023, 6:00am

Thanks!
And it’s great to watch your resolve this issue.
Thanks for sharing!

Topic		Replies	Views
The length of the embedding contents API	48	33801	December 13, 2023
How can I use Embeddings with Chat GPT 3-5 Turbo Prompting	39	48330	December 12, 2023
Train (fine-tune) a model with text from books or articles API	62	27787	November 30, 2023
RAG is failing when the number of documents increase API	35	17878	December 17, 2024
Prompting with the chat/completions API against a large transcript file API	5	3575	October 4, 2023

Relationship between cosine similarity and chunk size?

Related topics