I created two embeddings using ADA for two words in Portuguese: cimento (cement) and sorvete (ice cream). The cosine similarity between them was 0.8 what is clearly wrong. Any thoughts?
For single word embeddings in a distinct language, I think you picked the wrong model. ada is best for comparing larger texts and user inputs to texts.
The best performance for semantic word embeddings there is reported as glove 600. You can see how to employ the Gensim model and obtain single word vectors here:
Just tested for larger texts and the results are still bad.
Using - stjiris/bert-large-portuguese-cased-legal-tsdae-gpl-nli-sts-MetaKD-v0 which is a Sentence Transformer for Portuguese, the results are consistent for single words or sentences.
I would expect that OpenAI Sentence Transformer would be far better than anyone else.
Even in English, you will get a cosine similarity of 0.8 for non-related things.
I have talked about this problem to death. So there are two solutions.
- Easy: Tighten your bounds for similarity.
- Hard: Post process the embedding vectors, removing the correlation and biases in the model that are causing this.
This is not a Portuguese thing! It’s an Ada-002 thing!
Yes, that’s exactly what I got. 0.8 for any kind of comparison.
Thanks for clarifying!
I see a lot of tutorials teaching how to do RAG (with OpenAI) and the basic principle is to search a vector database to filter relevant text associated to the question. This search is made via semantic similarity. How this can work reasonably well given this cosine similarity behavior.
It works for most because folks aren’t paying attention to the correlation value. They are only grabbing the top K highest scoring things. So the non-isotropic behavior of the model is swept under the rug.
They aren’t looking for de-correlated things, or orthogonal things. Or measuring how uncorrelated your “top K” results really are. Another reason is that, with RAG, they feed the top K answers into the LLM, and let the LLM decide if it’s related or not.
So because the LLM can also sort out the non-correlated results, it also sweeps the issue under the rug.
There are many theories on why this is happening in the model, like overcompensating for hidden states. And it seems to happen in most models, but it is really bad in ada-002.
Here is the paper discussing this:
Also, here is code that I implemented “ABTT” that essentially de-biases and de-correlates your embeddings.
Thanks a lot Curt.
I will definitely take a look on those papers.