It’s kind of funny actually, from my understanding these functions are already, in a way, baked into how it produces natural language responses, which is why I think I struggled to figure out what to do about this initially. Also, I was afk for a bit.
Anyways, I think I finally figured out the problem here: Contextual similarity (your goal), and Semantic similarity (cosine & pearson similarity functions) are not the same thing.
This is fun to talk about, because I don’t get to go this in-depth often, but essentially contextual interpretation is a part of ChatGPT’s underlying proprietary architecture, as it’s been a difficult problem for NLP researchers to solve for some time now. In fact, contextual understanding is exactly why these LLMs are trained on such vast amounts of data; so far, we only seem to see accurate contextual interpretations through significantly large parameters.
Semantic similarity and Contextual similarity’s differences are more difficult to see on the surface, because most people haven’t taken advanced linguistic pragmatics courses, and it’s hard.
Contextual understanding requires more data than 1-shot Q&As to interpret, in both linguistics and in NLP. It was bothering me why I felt like your setup looks like it works and is the correct approach, but in practice cannot be so. This is because examples like your sentence setups are typically used to explain concepts in classrooms, but don’t suffice for in the wild interpretation. You can use Grice’s maxims in a way to kind of get there (Grice’s Maxims of Conversation: The Principles of Effective Communication – Effectiviology) with single Q&A examples, but it still typically requires more information than that.
Your current approach - pearson correlation and cosine similarity, essentially analyzes syntactic and content relationships. Content is not the same as context. Content can match without being relevant. Relevancy is hard to measure, because it is dynamic, different for each conversation, and can flip completely based upon a single utterance. and cannot be done using single sentences very effectively.
An example:
Sentence 1: John went to the bank.
Let us imagine what the next sentence would be. Think in your head what is most likely to come after this sentence. What would John do at the bank?
Keeping that idea in your head, let’s say this is the next sentence:
Sentence 2: He fished for hours.
Were you expecting a sentence like that? What would you guess the similarity scores would be? Do you think this would pass your contextual similarity scores? What if I told you I could make a guess here that it would probably pass your cosine similarity / pearson correlation test?
Now, I can’t say for certainty, but I bet you were expecting something along the lines of:
He talked to the Bank teller and withdrew money.
Here’s the trick; A “bank” can be a money bank OR a river bank semantically speaking. cosine similarity will measure both as passing, because it is measuring semantic similarity, of which both exist in natural language use. They are semantically relevant if you look at just two sentences.
However
The sentence preceding “John went to the bank” can determine contextual understanding fairly quickly.
If the preceding sentence would be:
John ran out of money.
Then “He fished for hours.” is contextually irrelevant. That would change, however, if the preceding sentence became
John notices how pretty the river is outside.
Then “He fished for hours.” is contextually relevant.
Like I said, this isn’t kiddy stuff; this is not the “easy” part of linguistics.
Taking all of this into account, I think I can now safely say that in order to create and define a measurement system for contextual similarity, you need more data than 2 sentences. That is a must. It’s possible to figure out some tricks once you have that data, but I can tell you right now using cosine similarity stuff on what you have (Q/A pairs) will mislead you.
Don’t get me wrong, impressive approach, truly. But it’s not measuring what you think it is.
Hopefully this helps you more than confuses you, and I apologize if I did!