FWIW, I tried embedding and correlating the two things (small chunks), and it wasn’t great:
Msg0 = "And there sat in a window a certain young man named Eutychus, being fallen into a deep sleep: and as Paul was long preaching, he sunk down with sleep, and fell down from the third loft, and was taken up dead."
Msg1 = "who fell from the loft?"
This resulted in poor cosine similarity:
Dot Product: 0.7983558998154735
So this brings up this comment:
So I had GPT-4 translate the Bible verse to a more modern version, with this system message: “Modernize this ancient biblical verse to use modern language and grammar.”
Which then I get this pair:
Msg0 = "A young man named Eutychus was sitting in a window, having fallen into a deep sleep. As Paul's sermon stretched on, he was overcome with sleep and fell from the third floor window. Those below presumed him dead when they picked him up."
Msg1 = "who fell from the loft?"
With only slightly better cosine similarity:
Dot Product: 0.809386048240508
So it looks like the real problem is your query is really thin on keywords, so I added more to the query:
Msg0 = "A young man named Eutychus was sitting in a window, having fallen into a deep sleep. As Paul's sermon stretched on, he was overcome with sleep and fell from the third floor window. Those below presumed him dead when they picked him up."
Msg1 = "What was the name of the man, who sat at a window, fell to his death after falling asleep, as told in Paul's sermon."
And now get a big improvement in cosine similarity:
Dot Product: 0.9323429289764913
So because your query is so sparse, it’s information starved. Looks like you need to beef this up.
So how do you do this automatically? I just used HyDE.
So I had GPT-4, with system prompt: “Answer this question as it relates to the Bible.”, with the question in User of “Who fell from the loft?”. It gave the answer of “A young man named Eutychus fell from the loft as related in the Bible, specifically in Acts 20:9”, which I then correlated:
Msg0 = "A young man named Eutychus was sitting in a window, having fallen into a deep sleep. As Paul's sermon stretched on, he was overcome with sleep and fell from the third floor window. Those below presumed him dead when they picked him up."
Msg1 = "A young man named Eutychus fell from the loft as related in the Bible, specifically in Acts 20:9."
And get a similar higher cosine similarity than the intentionally beefed up version of the prompt.
Dot Product: 0.9356396103687558
So, bottom line, it looks like you need to use HyDE AND also embed smaller chunks AND modernize the biblical text before you embed. ![:sweat_smile: :sweat_smile:](https://emoji.discourse-cdn.com/twitter/sweat_smile.png?v=12)
Granted, you could just ask GPT-4, since it was trained on the Bible. But what fun is that? Especially when the embeddings can pull the exact biblical verse back if needed. ![:face_with_monocle: :face_with_monocle:](https://emoji.discourse-cdn.com/twitter/face_with_monocle.png?v=12)
But hopefully you see what I am doing, which is transforming both the query and targets so that they align more and more. Including chopping the targets into smaller chunks to increase the correlation.
These transformations are key to building RAG systems that perform.