Questions on the use of text-embedding-ada-002 model

Hello everyone,

I am quite new to text embeddings and text comparison in general, but I want to use text-embedding-ada-002 to compare a job description with various resumes. The idea is to sort the resumes based on their similarity with the job description.

However, I have a little difficulty in understanding the results I am getting.

The process I am following is the below:

  1. Extract the text from the job description pdf
  2. Clean the text by removing non ascii characters, punctuation, new lines and converting to lowercase
    3)Tokenize the text with the below code
enc = tiktoken.encoding_for_model("text-embedding-ada-002")
tokenized_jd = enc.encode(jd)
  1. Get the embedding with the below code
    job_description_embedding = openai.Embedding.create(encoding_format="float", model="text-embedding-ada-002", input=[tokenized_jd])['data'][0]['embedding']

  2. For each resume repeat the same process, get the embedding and then compare the job description embedding and the current resume embedding by using
    cosine_similarity(job_description_embedding, candidate_resume_embedding)

What is puzzling to me is that all the comparisons show very high similarity even if the resume has nothing to do with the job. For example, the lowest similarity score I have gotten is 0.72 for a resume that from a nutritionist even though the job description was for a senior java developer so I was expecting a lower similarity score.

I wonder if anyone can shed some lights on this? Is this normal? Am I using ada002 wrongly or could there be another reason for these results?

Thank you in advance

This has been discussed extensively before, basically the embedding vectors from ada-002 get squished together, leading to high correlations no matter what. You just need to adjust your correlation expectations, or you can batch process out the vector correlation using PCA to make them more isotropic (spread out) in the future.

See …

The similarity score is more useful for comparing with each other. If you choose another resume about a developer, do you get a higher similarity score?

Thank you very much for sharing! I will definitely try it out!

Yes most of the time in this case the similarity scores I have gotten are quite higher. Mostly around 0.8 to almost 0.9