I have done that.

But the question still remains: Why does the cosine similarity search not work as expected?

I use a form of HyDE, which creates the “concept” that is used to construct the NearText search. Same results.

For the record, GPT-4 addresses the question simply because it’s in it’s training, but does not answer the question because it is instructed NOT to answer questions it can’t verify in the returned context documents.

How would you increase the length of this query without actually answering the question?

1 Like

Your PDF is a collection of all the acts, right? You are wanting 20:9, but are embedding all acts 20?

Based on your distance metrics of 0.19, you are WAY OFF in your correlations. You should be less than 0.05 or so to even be in the ballpark.

So shrink the embeddings down to just each paragraph or statement in the Bible. Also keep the big embeddings too. You can mix and match the embedding sizes, because embedding more data than the query just adds noise, and will give you the way far off correlations.

1 Like

One PDF per Chapter, correct. So, Acts 20 is one pdf. Acts 21 is one pdf, etc…

Sounds like your recommendation is smaller chunk sizes for Bible. Which is my suspicion, but I’ve not found any documentation that corroborates this. I’m going by memory because it’s been a while since I last researched this, but from what I recall, given my vector dimensions, my chunk size was fine.

This sounds like exactly what is happening. I’m still waiting to hear back from Weaviate, but I think I’m going to try smaller chunk sizes to see if that makes a difference.

Thanks for the suggestions!

What would happen if a modernized version of the bible was used for the embeddings?
The idea is that the language used in the bible is vastly different from the language we use today and that the amount of text the models have been trained on is pretty much all modern.
If this idea has some merit it might also be worth trying to reformulate the question in bible language but I suppose the first approach could be more effective.

1 Like

That is an interesting idea. But now, my cosine similarity searches would have to depend upon essentially summarizations of the passages instead of the passages themselves. That would make me a little nervous.

1 Like

That could or will cause issues I expect.
But what about rewrite instead of summarize.
Maybe “rewrite Acts in ELI10” and then chunk it instead of summarize before chunking?

It could raise the similarity values when a user asks a question in modern english when comparing to the bible English.

Or would it? Maybe someone with more knowledge can provide some insights.

1 Like

It’s worth considering. Certainly makes sense logically. As with all these AI systems, the proof is in the execution. I would definitely not try this with gpt-3.5-turbo-16k. I’m sure gpt-4 could do it, but could be costly.

Anyway, thanks for the idea. I’ll definitely think about it.

1 Like

Excuse my ignorance, but what is this?

Extremely low IQ: 10?

No, it’s “Explain it like I’m 10”, except the popular Reddit is ELI5.

How about train your Hyde on writing bible verses from user input? :man_genie:

1 Like

Hmmmm… Now that’s a thought. Shouldn’t be difficult.

I’ll run this by the Weaviate folks and see what they say. I wonder if the OpenAI transformer used in their cosine similarity searches is simply failing to bridge modern and Biblical English. It only does it sometimes, but when it does, it’s a whopper. Like, “Who fell out of the loft?”

You can say “semantic similarity” or “embeddings search”. Cosine method or dot-product (which are equivalent for ada vectors) can kind of be assumed and will tweak someone to hear.

If you really want to optimize lookups, you can discard the most uniform dimensions from the whole set after doing statistical analyses, accentuating differences. There’s probably lots of aspects of language model internal states like “has to do with refrigeration pumps” or “streets of Paris” that just are wasted on a uniform ancient text.

1 Like

FWIW, I tried embedding and correlating the two things (small chunks), and it wasn’t great:

Msg0 = "And there sat in a window a certain young man named Eutychus, being fallen into a deep sleep: and as Paul was long preaching, he sunk down with sleep, and fell down from the third loft, and was taken up dead."
Msg1 = "who fell from the loft?"

This resulted in poor cosine similarity:
Dot Product: 0.7983558998154735

So this brings up this comment:

So I had GPT-4 translate the Bible verse to a more modern version, with this system message: “Modernize this ancient biblical verse to use modern language and grammar.”

Which then I get this pair:

Msg0 = "A young man named Eutychus was sitting in a window, having fallen into a deep sleep. As Paul's sermon stretched on, he was overcome with sleep and fell from the third floor window. Those below presumed him dead when they picked him up."
Msg1 = "who fell from the loft?"

With only slightly better cosine similarity:
Dot Product: 0.809386048240508

So it looks like the real problem is your query is really thin on keywords, so I added more to the query:

Msg0 = "A young man named Eutychus was sitting in a window, having fallen into a deep sleep. As Paul's sermon stretched on, he was overcome with sleep and fell from the third floor window. Those below presumed him dead when they picked him up."
Msg1 = "What was the name of the man, who sat at a window, fell to his death after falling asleep, as told in Paul's sermon."

And now get a big improvement in cosine similarity:
Dot Product: 0.9323429289764913

So because your query is so sparse, it’s information starved. Looks like you need to beef this up.

So how do you do this automatically? I just used HyDE.

So I had GPT-4, with system prompt: “Answer this question as it relates to the Bible.”, with the question in User of “Who fell from the loft?”. It gave the answer of “A young man named Eutychus fell from the loft as related in the Bible, specifically in Acts 20:9”, which I then correlated:

Msg0 = "A young man named Eutychus was sitting in a window, having fallen into a deep sleep. As Paul's sermon stretched on, he was overcome with sleep and fell from the third floor window. Those below presumed him dead when they picked him up."
Msg1 = "A young man named Eutychus fell from the loft as related in the Bible, specifically in Acts 20:9."

And get a similar higher cosine similarity than the intentionally beefed up version of the prompt.
Dot Product: 0.9356396103687558

So, bottom line, it looks like you need to use HyDE AND also embed smaller chunks AND modernize the biblical text before you embed. :sweat_smile:

Granted, you could just ask GPT-4, since it was trained on the Bible. But what fun is that? Especially when the embeddings can pull the exact biblical verse back if needed. :face_with_monocle:

But hopefully you see what I am doing, which is transforming both the query and targets so that they align more and more. Including chopping the targets into smaller chunks to increase the correlation.

These transformations are key to building RAG systems that perform.

3 Likes

Wow. Thank you. Yes, I see what you’re doing. And I’m impressed at how you arrived at the conclusions. Right now, this dataset is running on a site with multiple knowledgebases. I need to move it to it’s own so that I can implement these changes.

I don’t know if this is the ultimate solution, but it sounds like it. Thank you for your time on this.

1 Like

Conceptually, vector embeddings are combinations of all of the semantic features of the passage with some value, represented in 1536 dimensional vector space. The fact that your query matches part of the text exactly doesn’t mean that the distance from the query vector is close to the chunk vector, particularly not if that chunk vector has other “peaky” semantic properties.

So, the takeaway is that small chunk size will reduce the other “peaky” properties?

By having fewer properties, yes. Although whether or not that improves your RAG performance overall will depend on whether that elevates the intended chunks as intended, or results in many very similar smaller chunks which then confound performance by virtue of having lost their distinctive properties.

Curt’s treatment is really the most informative, but basically you can improve cosine similarity either by making the question look more like the stored vectors (more enriched questions), or you can make the vectors look more like your query (smaller chunks).

Alright, Gentlemen, I thank you for your input. I’ve come up with a plan.

I already have built into my architecture the ability to add a summary of the main document to each chunk. That summary is part of the vectorized object. So, I should be able to add code that would summarize the chunk itself. So, in essence, I could reduce the size of each chunk of a Bible Chapter, and then add a modern English summary to that chunk

In addition, I can gear the System and concept prompts to give a more clear indication to the AI we are dealing with Biblical texts.

My plan isn’t exactly the above, and I do wonder what effect duplicating the text will have, but it at least is pushing me in the right direction while still utilizing the infrastructure I currently have in place.

What do all you think?

When you summarize, you could be losing some information.

But, I can see the value in summarization, because you get smaller chunks, and maybe you can “modernize” it as well. Just beware of information loss, things dropped after summarizing.

So try it and see. But I think you know the path is really governed by one simple statement, which is the same text embedded twice gives the exact same (within roundoff) embedding vector.

The closer you can shape the query and targets into similarity, the higher this correlation.

But you don’t want to add to much fake information, or drop critical information in this process, because both of these will fight against your true correlation scores.

So keeping these ideas in mind, transform away! There is no one-size-fits-all answer because each domain can have its unique quirks.

1 Like

@curt.kennedy OMG! You are The Man!

So, I’ve began re-working my code to accommodate this “Bible translation” scheme. I’ve not yet been able to reduce the chunk size or prompt – I actually need to move this dataset to it’s own site for that. But, I was able to jury-rig the ability to create translated (into Modern English) versions of each Bible text chunk. This is the result:

Yes, the model still says it doesn’t understand, but this is the first time I’ve gotten the relevant passage, Acts 20, to appear in the top 5 results for this query since I’ve been trying. Normally, it appears after 20 or so citations.

That means, translating the Bible to a more modern English, which is the the vernacular most people today (from English speaking countries, that is) would ask questions, is a solid plan. And, I believe reducing the chunks size will reduce the noise, and assure that the model can actually understand the documents being returned.

Wow. It’s going to take a minute to translate this thing, but once done, it looks like we’re going to have a pretty good Bible Q & A system. Thank you!

2 Likes

Oh yeah, and your other suggestion of elongating the prompt also works!

1 Like