How does ChatGPT realize contextual similarity among sentences

ChatGPT is able to understand if two sentences have the same context or are related.
For example, ‘How’s the weather today?’ and ‘It is pretty good out there.’

However, if I ask ChatGPT to provide me with the codes to achieve the results using API. It will provide me with the code using GPT API or sentenceTransformer that embeds the sentences into vectors and computes the similarity score. Hence, it is a semantic similarity task here. Computing the similarity score using embeddings does not yield the same results as ChatGPT. For instance, the identical sentences I mentioned above will only get a 0.2 similarity score but it gets to a 0.4 if I add ‘today’ to the response sentence.

Therefore, I wonder how should I do it. Should I use a Q&A model instead?

Objective: Use a model to determine if sentences are related or have the same context despite having totally different meanings.

Welcome to the community!

This is a fun and challenging question! You might not realize it, but this is at the heart of semantics and linguistics.

Sentences can be related in a wide variety of ways; I would not use that as a measurable metric without understanding what relation you’re asking for. So, I would stick to context interpretation.

What I need to understand though is what you mean by “same context despite having totally different meanings.” If two speakers mean different things, but the context is the same, that seems to break one of Grice’s Maxims to my understanding. How can two speakers be talking about two different things but maintain context while doing so? There may be misinterpretation, but again, that would mean the context would be different.

I need more details of what you’re trying to look for and what you’re trying to do with the AI.

Thank you for your response.

Here is my objective: I am trying to build a model that can determine if a student is distracted during class through his/her conversation with a teacher.

Therefore, here are some examples.
Case 1: Q&A type scenario.
teacher: ’ What is your favourite food?’
student: ’ Basketball’
model: irrelevant/ low similarity score

Case 2:
teacher: ’ How is the weather today?’
student:’ It is pretty good.’ or ‘It is pretty good today.’
model: relevant / high similarity score.

The similarity scores can be cosine similarity or Pearson correlation.

My attempt: Load a pre-trained LLM and calculate the similarity score of the sentence embeddings.

If I directly ask the ChatGPT the relationship between two sentences, ChatGPT is able to output an excellent answer. How can I achieve it using API and obtain binary classification results or a confidence level by a model?

1 Like

So here are the codes I have tried using sentence embedding and calculate Pearson correlation coefficient.

from sentence_transformers import SentenceTransformer, util

def calculate_similarity(sentence1, sentence2, model_name='paraphrase-MiniLM-L6-v2'):
    # Load pre-trained model
    model = SentenceTransformer(model_name)
    # Encode sentences
    embeddings1 = model.encode(sentence1, convert_to_tensor=True)
    embeddings2 = model.encode(sentence2, convert_to_tensor=True)
    # Calculate cosine similarity between embeddings
    similarity_score = util.pytorch_cos_sim(embeddings1, embeddings2)[0].item()
    return stats.pearsonr(embeddings1.cpu(), embeddings2.cpu()).statistic

# Example usage
sentence1 = "How are you "
sentence2 = "I am bad"

similarity_score = calculate_similarity(sentence1, sentence2)

# Threshold for determining relevance (you can adjust as needed)
threshold = 0.2
if similarity_score < threshold:
    print("The sentences are contextually irrelevant.")
    print("The sentences may be contextually relevant.")

It shows a very low score

The sentences are contextually irrelevant.

Then I try to use ChatGPT API as a trick to achieve my objective.

import openai

# Define the input sentences
sentence1 = 'How are you'
sentence2 = 'I am bad'
input_text = f"Are the sentences related? Only answer 1 for yes or 0 for no. '{sentence1}' and '{sentence2}'"

# Call the OpenAI GPT-3 API
response = openai.Completion.create(
    temperature = 0.1
# Print the generated response

ChatGPT API yields a very good result but its speed is not quite fast enough considering I am having it generating a response instead of accessing a binary classifier.


1 Like

It’s kind of funny actually, from my understanding these functions are already, in a way, baked into how it produces natural language responses, which is why I think I struggled to figure out what to do about this initially. Also, I was afk for a bit.

Anyways, I think I finally figured out the problem here: Contextual similarity (your goal), and Semantic similarity (cosine & pearson similarity functions) are not the same thing.

This is fun to talk about, because I don’t get to go this in-depth often, but essentially contextual interpretation is a part of ChatGPT’s underlying proprietary architecture, as it’s been a difficult problem for NLP researchers to solve for some time now. In fact, contextual understanding is exactly why these LLMs are trained on such vast amounts of data; so far, we only seem to see accurate contextual interpretations through significantly large parameters.

Semantic similarity and Contextual similarity’s differences are more difficult to see on the surface, because most people haven’t taken advanced linguistic pragmatics courses, and it’s hard.

Contextual understanding requires more data than 1-shot Q&As to interpret, in both linguistics and in NLP. It was bothering me why I felt like your setup looks like it works and is the correct approach, but in practice cannot be so. This is because examples like your sentence setups are typically used to explain concepts in classrooms, but don’t suffice for in the wild interpretation. You can use Grice’s maxims in a way to kind of get there (Grice’s Maxims of Conversation: The Principles of Effective Communication – Effectiviology) with single Q&A examples, but it still typically requires more information than that.

Your current approach - pearson correlation and cosine similarity, essentially analyzes syntactic and content relationships. Content is not the same as context. Content can match without being relevant. Relevancy is hard to measure, because it is dynamic, different for each conversation, and can flip completely based upon a single utterance. and cannot be done using single sentences very effectively.

An example:

Sentence 1: John went to the bank.

Let us imagine what the next sentence would be. Think in your head what is most likely to come after this sentence. What would John do at the bank?
Keeping that idea in your head, let’s say this is the next sentence:

Sentence 2: He fished for hours.

Were you expecting a sentence like that? What would you guess the similarity scores would be? Do you think this would pass your contextual similarity scores? What if I told you I could make a guess here that it would probably pass your cosine similarity / pearson correlation test?

Now, I can’t say for certainty, but I bet you were expecting something along the lines of:

He talked to the Bank teller and withdrew money.

Here’s the trick; A “bank” can be a money bank OR a river bank semantically speaking. cosine similarity will measure both as passing, because it is measuring semantic similarity, of which both exist in natural language use. They are semantically relevant if you look at just two sentences.


The sentence preceding “John went to the bank” can determine contextual understanding fairly quickly.

If the preceding sentence would be:

John ran out of money.

Then “He fished for hours.” is contextually irrelevant. That would change, however, if the preceding sentence became

John notices how pretty the river is outside.

Then “He fished for hours.” is contextually relevant.

Like I said, this isn’t kiddy stuff; this is not the “easy” part of linguistics.

Taking all of this into account, I think I can now safely say that in order to create and define a measurement system for contextual similarity, you need more data than 2 sentences. That is a must. It’s possible to figure out some tricks once you have that data, but I can tell you right now using cosine similarity stuff on what you have (Q/A pairs) will mislead you.

Don’t get me wrong, impressive approach, truly. But it’s not measuring what you think it is.

Hopefully this helps you more than confuses you, and I apologize if I did!

@assa8945 Now, I’m sure the first question you’re gonna ask is “How can I solve my problem, then?”

As I think about it, I think the best way to handle your problem would be to leverage the black box algorithms within GPT to provide a contextual scenario using the chat setup. Ask it for contextual relevancy, and force it to respond only with a yes/no answer, with no extra details. Then, transmute the yes/no answer into a 1 or 0 that you can then use as a binary classification. Each context query should be treated as its own chat conversation to avoid intermingling the contextual data.

As it stands now, this is the best way to gain some sort of contextual relevancy score. If you want a mathematical or probabilistic formula to determine a concrete value, then unfortunately, that does not appear to be currently possible.

Thank you so much. It is amazing that you have written such a long paragraph to explain the problem to me.
I am new to NLP so I was trying to figure out if there is a contextual similarity downstream task.

Hence, based on your response. I have summarized a few highlights and please correct me if I am wrong.

  1. Contextual relevance is still a task that has not been solved by the researchers.
  2. Therefore, the way that ChatGPT answers such a contextual relevance question is purely based on its black box LLM.
  3. The best way to achieve my objective is to leverage ChatGPT API in the following way:
import openai

# Define the input sentences
sentence1 = 'How are you'
sentence2 = 'I am bad'
input_text = f"Are the sentences related? Only answer 1 for yes or 0 for no. '{sentence1}' and '{sentence2}'"

# Call the OpenAI GPT-3 API
response = openai.Completion.create(
    temperature = 0.1
# Print the generated response
1 Like

Bingo! Yup, that’s a perfect way to sum it all up!

I’m not as good with the API as others I will say, but just be prepared for potential…adjustments to your prompting to ensure it doesn’t deviate from the 1/0 response. I think they integrated a respond in json format recently iirc, which could help it to maintain the desired response format.

The prompting thing though is a much easier problem to solve, fix, and help with. So, this is definitely the direction I would recommend going in.

Glad you found it useful!

If you think of this forum like documentation, giving a detailed response like I did could be extremely helpful for others down the road too hopefully. Your problem is a perfect problem to demonstrate just how complex a seemingly simple problem can be with these models, and expert-level discussions like this helps the advanced folks too looking for answers from complex problems.

Keep looking into NLP - the more these models kick off, the more useful it’s going to be for sure.