Relationship between cosine similarity and chunk size?

This is more of an embedding question than a model question.

I am embedding my content using Weaviate’s Text2Vec-OpenAI transformer. This uses the text-embedding-ada-002 model:

Module: text2vec-openai
“moduleConfig”: {
“text2vec-openai”: {
“model”: “ada”,
“modelVersion”: “002”,
“type”: “text”,
“vectorizeClassName”: true
}
Model: text-embedding-ada-002
Tokenizer: cl100k_base
Dimensions: 1536

My question is: Is there a relationship between the size of my vector objects and the efficiency/effectiveness of the cosine similarity search? In other words, do I get better results with smaller chunks than larger, or there really isn’t a difference until you get to really large or really small? Right now, I’m working with a chunk size of 2500 characters which, up until now, has been perfect.

Recently, I started working the The Bible, which has all these verses, and I’m wondering if I should reduce the chunk size or if it really makes a difference? Unlike other documents (regulatory texts), my searches do not seem, overall, to be as effective. Anybody have any thought on this?

It’s not really about “size of vector object”, but I would think more about the quality of semantic search you get by changing the size of chunks.

A larger chunk can have more context, so if you ask about if Luke met John, the chance that the two are in the same passage is higher. But then the tradeoff is loss of specificity, that you’re going to have a lot of returns for “rules about food” that aren’t as distinguished in scores to arrive at what you want. A lot of things that are semantically just “bible-y”.

With larger chunks, you have less of them, and less total database vectors to search and to score against your comparison input. 100 word text vs 1000 word text means 10x as much computation to compare similarity against the whole corpus.

You pay by the token, so total price is the total size of document you submit to OpenAI and not the chunking method, unless you are using different overlap schemes that grow the total input, or different overhead of metadata.

Then your ultimate question is the usefulness. Is 1000 words, a length where you can only pass one chunk to an AI better than a jumble of ten 100 word passages?

So I don’t have answers, just considerations you might ponder.

If doing RAG, then chunk on sizes that contain one or two coherent thoughts.

For global general search, you can go bigger, but think about how big that chunk is compared to the incoming query.

Perfection is that the query identically matches something in your database (CS = 1.0). Which isn’t realistic usually because it means the query is identical to something in your database.

Off this perfect case, you are leaning into the AI’s training and understanding of semantics. So it’s really your call on performance given this mismatch, since you are now evaluating the AI model performance.

I had some ideas of dynamically expanding and contracting the text surrounding a top embedding(s) to absolutely maximize it iteratively, and return the maximized chunk extracted dynamically from your documents. I called it “heatmap” or something like that in this forum.

I was trying to troubleshoot a particular problem. In all of the Bible:

It seems a young man named Eutychus fell from the third loft, according to Acts 20:9 which states “And there sat in a window a certain young man named Eutychus, being fallen into a deep sleep: and as Paul was long preaching, he sunk down with sleep, and fell down from the third loft, and was taken up dead.”

So, a similarity search using the concept, “who fell from the loft?” should bring up Acts 20 in a relatively short distance.

The passage in 1 Kings 17 refers to Elijah taking the widow’s son up to a loft, but does not mention anyone falling.

And yet, my cosine similarity searches for “who fell from the loft”, “who fell from the loft while Paul was preaching”, “who fell out the window”, etc… do not bring up Acts 20 until 20 or so hits down the list. In other words, it brings up a variety of other passages, none of which have no mention of “loft” or “falling” before bringing up this passage.

In fact, using this query:

      limit: 20
      nearText: {
        concepts: ["who fell from the loft while Paul preached?"],
      }

Here are the results:

{
  "data": {
    "Get": {
      "SolrCopy": [
        {
          "_additional": {
            "distance": 0.17151582
          },
          "title": "Acts-28.pdf"
        },
        {
          "_additional": {
            "distance": 0.17632425
          },
          "title": "Acts-21.pdf"
        },
        {
          "_additional": {
            "distance": 0.1833719
          },
          "title": "Acts-17.pdf"
        },
        {
          "_additional": {
            "distance": 0.18469441
          },
          "title": "Acts-14.pdf"
        },
        {
          "_additional": {
            "distance": 0.18544805
          },
          "title": "Acts-16.pdf"
        },
        {
          "_additional": {
            "distance": 0.18624592
          },
          "title": "Acts-9.pdf"
        },
        {
          "_additional": {
            "distance": 0.18657953
          },
          "title": "Acts-9.pdf"
        },
        {
          "_additional": {
            "distance": 0.18708062
          },
          "title": "Acts-28.pdf"
        },
        {
          "_additional": {
            "distance": 0.1901241
          },
          "title": "Acts-18.pdf"
        },
        {
          "_additional": {
            "distance": 0.19148755
          },
          "title": "Acts-16.pdf"
        },
        {
          "_additional": {
            "distance": 0.1919235
          },
          "title": "Acts-24.pdf"
        },
        {
          "_additional": {
            "distance": 0.19290006
          },
          "title": "Acts-18.pdf"
        },
        {
          "_additional": {
            "distance": 0.19304073
          },
          "title": "Acts-19.pdf"
        },
        {
          "_additional": {
            "distance": 0.19388211
          },
          "title": "Acts-22.pdf"
        },
        {
          "_additional": {
            "distance": 0.19429177
          },
          "title": "Galatians-2.pdf"
        },
        {
          "_additional": {
            "distance": 0.19430208
          },
          "title": "Acts-19.pdf"
        },
        {
          "_additional": {
            "distance": 0.19438648
          },
          "title": "Acts-20.pdf"
        },
        {
          "_additional": {
            "distance": 0.19662309
          },
          "title": "Acts-25.pdf"
        },
        {
          "_additional": {
            "distance": 0.19671476
          },
          "title": "Acts-24.pdf"
        },
        {
          "_additional": {
            "distance": 0.19726074
          },
          "title": "I_Samuel-28.pdf"
        }
      ]
    }
  }
}

Where, in my thinking, Acts 20 should be near the top of the list. So I’m wondering if I’m getting this type of result because of my chunk size (2500 characters)?

Would you say the keywords “loft”, “fell”, “Paul”,”preached” would be a better fit for this exact passage?

If so you might want to add keywords to your embedding search. So search on keywords, and embeddings, and combine them to give an overall ranking to push this passage to the top of your list. The theory is that this passage only contains those four keywords.

If that’s not true, then what this means is that either you need to reduce your chunk size, or you need to increase your query length.

You can also try HyDE, especially with the Bible, I’m pretty sure GPT was trained on the Bible, therefore it’s fake answer will correlate better with your data, give it a shot!

1 Like

I have done that.

But the question still remains: Why does the cosine similarity search not work as expected?

I use a form of HyDE, which creates the “concept” that is used to construct the NearText search. Same results.

For the record, GPT-4 addresses the question simply because it’s in it’s training, but does not answer the question because it is instructed NOT to answer questions it can’t verify in the returned context documents.

How would you increase the length of this query without actually answering the question?

1 Like

Your PDF is a collection of all the acts, right? You are wanting 20:9, but are embedding all acts 20?

Based on your distance metrics of 0.19, you are WAY OFF in your correlations. You should be less than 0.05 or so to even be in the ballpark.

So shrink the embeddings down to just each paragraph or statement in the Bible. Also keep the big embeddings too. You can mix and match the embedding sizes, because embedding more data than the query just adds noise, and will give you the way far off correlations.

1 Like

One PDF per Chapter, correct. So, Acts 20 is one pdf. Acts 21 is one pdf, etc…

Sounds like your recommendation is smaller chunk sizes for Bible. Which is my suspicion, but I’ve not found any documentation that corroborates this. I’m going by memory because it’s been a while since I last researched this, but from what I recall, given my vector dimensions, my chunk size was fine.

This sounds like exactly what is happening. I’m still waiting to hear back from Weaviate, but I think I’m going to try smaller chunk sizes to see if that makes a difference.

Thanks for the suggestions!

What would happen if a modernized version of the bible was used for the embeddings?
The idea is that the language used in the bible is vastly different from the language we use today and that the amount of text the models have been trained on is pretty much all modern.
If this idea has some merit it might also be worth trying to reformulate the question in bible language but I suppose the first approach could be more effective.

1 Like

That is an interesting idea. But now, my cosine similarity searches would have to depend upon essentially summarizations of the passages instead of the passages themselves. That would make me a little nervous.

1 Like

That could or will cause issues I expect.
But what about rewrite instead of summarize.
Maybe “rewrite Acts in ELI10” and then chunk it instead of summarize before chunking?

It could raise the similarity values when a user asks a question in modern english when comparing to the bible English.

Or would it? Maybe someone with more knowledge can provide some insights.

1 Like

It’s worth considering. Certainly makes sense logically. As with all these AI systems, the proof is in the execution. I would definitely not try this with gpt-3.5-turbo-16k. I’m sure gpt-4 could do it, but could be costly.

Anyway, thanks for the idea. I’ll definitely think about it.

1 Like

Excuse my ignorance, but what is this?

Extremely low IQ: 10?

No, it’s “Explain it like I’m 10”, except the popular Reddit is ELI5.

How about train your Hyde on writing bible verses from user input? :man_genie:

1 Like

Hmmmm… Now that’s a thought. Shouldn’t be difficult.

I’ll run this by the Weaviate folks and see what they say. I wonder if the OpenAI transformer used in their cosine similarity searches is simply failing to bridge modern and Biblical English. It only does it sometimes, but when it does, it’s a whopper. Like, “Who fell out of the loft?”

You can say “semantic similarity” or “embeddings search”. Cosine method or dot-product (which are equivalent for ada vectors) can kind of be assumed and will tweak someone to hear.

If you really want to optimize lookups, you can discard the most uniform dimensions from the whole set after doing statistical analyses, accentuating differences. There’s probably lots of aspects of language model internal states like “has to do with refrigeration pumps” or “streets of Paris” that just are wasted on a uniform ancient text.

1 Like

FWIW, I tried embedding and correlating the two things (small chunks), and it wasn’t great:

Msg0 = "And there sat in a window a certain young man named Eutychus, being fallen into a deep sleep: and as Paul was long preaching, he sunk down with sleep, and fell down from the third loft, and was taken up dead."
Msg1 = "who fell from the loft?"

This resulted in poor cosine similarity:
Dot Product: 0.7983558998154735

So this brings up this comment:

So I had GPT-4 translate the Bible verse to a more modern version, with this system message: “Modernize this ancient biblical verse to use modern language and grammar.”

Which then I get this pair:

Msg0 = "A young man named Eutychus was sitting in a window, having fallen into a deep sleep. As Paul's sermon stretched on, he was overcome with sleep and fell from the third floor window. Those below presumed him dead when they picked him up."
Msg1 = "who fell from the loft?"

With only slightly better cosine similarity:
Dot Product: 0.809386048240508

So it looks like the real problem is your query is really thin on keywords, so I added more to the query:

Msg0 = "A young man named Eutychus was sitting in a window, having fallen into a deep sleep. As Paul's sermon stretched on, he was overcome with sleep and fell from the third floor window. Those below presumed him dead when they picked him up."
Msg1 = "What was the name of the man, who sat at a window, fell to his death after falling asleep, as told in Paul's sermon."

And now get a big improvement in cosine similarity:
Dot Product: 0.9323429289764913

So because your query is so sparse, it’s information starved. Looks like you need to beef this up.

So how do you do this automatically? I just used HyDE.

So I had GPT-4, with system prompt: “Answer this question as it relates to the Bible.”, with the question in User of “Who fell from the loft?”. It gave the answer of “A young man named Eutychus fell from the loft as related in the Bible, specifically in Acts 20:9”, which I then correlated:

Msg0 = "A young man named Eutychus was sitting in a window, having fallen into a deep sleep. As Paul's sermon stretched on, he was overcome with sleep and fell from the third floor window. Those below presumed him dead when they picked him up."
Msg1 = "A young man named Eutychus fell from the loft as related in the Bible, specifically in Acts 20:9."

And get a similar higher cosine similarity than the intentionally beefed up version of the prompt.
Dot Product: 0.9356396103687558

So, bottom line, it looks like you need to use HyDE AND also embed smaller chunks AND modernize the biblical text before you embed. :sweat_smile:

Granted, you could just ask GPT-4, since it was trained on the Bible. But what fun is that? Especially when the embeddings can pull the exact biblical verse back if needed. :face_with_monocle:

But hopefully you see what I am doing, which is transforming both the query and targets so that they align more and more. Including chopping the targets into smaller chunks to increase the correlation.

These transformations are key to building RAG systems that perform.

3 Likes

Wow. Thank you. Yes, I see what you’re doing. And I’m impressed at how you arrived at the conclusions. Right now, this dataset is running on a site with multiple knowledgebases. I need to move it to it’s own so that I can implement these changes.

I don’t know if this is the ultimate solution, but it sounds like it. Thank you for your time on this.

1 Like

Conceptually, vector embeddings are combinations of all of the semantic features of the passage with some value, represented in 1536 dimensional vector space. The fact that your query matches part of the text exactly doesn’t mean that the distance from the query vector is close to the chunk vector, particularly not if that chunk vector has other “peaky” semantic properties.

So, the takeaway is that small chunk size will reduce the other “peaky” properties?