I am quite new to text embeddings and text comparison in general, but I want to use text-embedding-ada-002 to compare a job description with various resumes. The idea is to sort the resumes based on their similarity with the job description.
However, I have a little difficulty in understanding the results I am getting.
The process I am following is the below:
Extract the text from the job description pdf
Clean the text by removing non ascii characters, punctuation, new lines and converting to lowercase
3)Tokenize the text with the below code
Get the embedding with the below code job_description_embedding = openai.Embedding.create(encoding_format="float", model="text-embedding-ada-002", input=[tokenized_jd])['data']['embedding']
For each resume repeat the same process, get the embedding and then compare the job description embedding and the current resume embedding by using cosine_similarity(job_description_embedding, candidate_resume_embedding)
What is puzzling to me is that all the comparisons show very high similarity even if the resume has nothing to do with the job. For example, the lowest similarity score I have gotten is 0.72 for a resume that from a nutritionist even though the job description was for a senior java developer so I was expecting a lower similarity score.
I wonder if anyone can shed some lights on this? Is this normal? Am I using ada002 wrongly or could there be another reason for these results?
This has been discussed extensively before, basically the embedding vectors from ada-002 get squished together, leading to high correlations no matter what. You just need to adjust your correlation expectations, or you can batch process out the vector correlation using PCA to make them more isotropic (spread out) in the future.