Use embeddings to measures how well an answer fits the question

Hi,
based on this old repo made by Tensorflow github[.com]/tensorflow/tfjs-models/tree/master/universal-sentence-encoder I thought to try using embeddings calculated with OpenAI to measure how well an answer fits a given question. I’ve made some tests but the results seem to be inconsistent. Here are some examples of model used | question | answer | dot product
First example

text-embedding-3-small | what's your name? | the pen is on the table | 0.2100624394110477
text-embedding-3-large | what's your name? | the pen is on the table | 0.15766028555933212
text-embedding-ada-002 | what's your name? | the pen is on the table | 0.7728555090641087

text-embedding-3-large behaves much better than text-embedding-ada-002

Second example:

text-embedding-3-small | what's your name? | my name is marco | 0.4585776699017621
text-embedding-3-large | what's your name? | my name is marco | 0.43940777514817453
text-embedding-ada-002 | what's your name? | my name is marco | 0.8278902967827175

text-embedding-ada-002 behaves much better than text-embedding-3-small which behaves slightly better than text-embedding-3-large

Third example

text-embedding-3-small | I've broken my laptop, what can I do? | come to our store to have some assistance | 0.1888791306336703
text-embedding-3-large | I've broken my laptop, what can I do? | come to our store to have some assistance | 0.1511331633002332
text-embedding-ada-002 | I've broken my laptop, what can I do? | come to our store to have some assistance | 0.7774396662317862

text-embedding-ada-002 behaves much better than text-embedding-3-small which behaves slightly better than text-embedding-3-large

What are your thoughts? My code is pretty easy

client = OpenAI(api_key=api_key)
model=["text-embedding-3-small","text-embedding-3-large","text-embedding-ada-002"]
input=["I've broken my laptop, what can I do?", "come to our store to have some assistance"]
for m in model:
    resp = client.embeddings.create(input=input,model=m)
    embedding_a = resp.data[0].embedding
    embedding_b = resp.data[1].embedding
    similarity_score = np.dot(embedding_a, embedding_b)
    print(m,"|", input[0],"|", input[1], "|",similarity_score)

Do you think dot product is useful? Based on platform.openai[.com]/docs/guides/embeddings/which-distance-function-should-i-use I thought yes, but I don’t find quality results

The older version of ada-002 would typically only have a distance/angle of between 0.7-0.9 I believe. While the other models are much more typical (0-1)

ok, but I would have expected new -3 models to output a bit higher result… In the second example they didn’t even reach a .5 distance, not quite good.
even in the third example the scores are pretty low

Instead of trying to use the values at face value try comparing them to more entries. Embeddings capture the essence of semantics. Are there connections between an answer that fits a question? Definitely. But, there’s also a lot of other connections that are being considered as well.

Trying to find an answer for a question is the basis of RAG using unstructured semantics through embeddings.

Do you have an example in mind?

I guess I can ask first what exactly are you trying to capture?

  1. If the answer fits in a grammatical sense,
  2. It is correct because it’s ripped from a knowledge database

For 1 I would gather 2 groups of >100 datapoints (easy to do with GPT). Calculate a centroid (which would ideally amplify the important dimensions you want to focus on & dampen the wildly varying dimensions), and then use a comparison test. One label for “doesn’t fit”, and another for “fits”. Pretty straightforward. One group are combinations of sentences that make sense. The other group are incoherent sentences.

If you’re looking for QA this can be found anywhere