Questions on the use of text-embedding-ada-002 model

tzortzinak95 · May 25, 2023, 9:32pm

Hello everyone,

I am quite new to text embeddings and text comparison in general, but I want to use text-embedding-ada-002 to compare a job description with various resumes. The idea is to sort the resumes based on their similarity with the job description.

However, I have a little difficulty in understanding the results I am getting.

The process I am following is the below:

Extract the text from the job description pdf
Clean the text by removing non ascii characters, punctuation, new lines and converting to lowercase
3)Tokenize the text with the below code

enc = tiktoken.encoding_for_model("text-embedding-ada-002")
tokenized_jd = enc.encode(jd)

Get the embedding with the below code
job_description_embedding = openai.Embedding.create(encoding_format="float", model="text-embedding-ada-002", input=[tokenized_jd])['data'][0]['embedding']
For each resume repeat the same process, get the embedding and then compare the job description embedding and the current resume embedding by using
cosine_similarity(job_description_embedding, candidate_resume_embedding)

What is puzzling to me is that all the comparisons show very high similarity even if the resume has nothing to do with the job. For example, the lowest similarity score I have gotten is 0.72 for a resume that from a nutritionist even though the job description was for a senior java developer so I was expecting a lower similarity score.

I wonder if anyone can shed some lights on this? Is this normal? Am I using ada002 wrongly or could there be another reason for these results?

Thank you in advance

curt.kennedy · May 25, 2023, 9:43pm

This has been discussed extensively before, basically the embedding vectors from ada-002 get squished together, leading to high correlations no matter what. You just need to adjust your correlation expectations, or you can batch process out the vector correlation using PCA to make them more isotropic (spread out) in the future.

See …

CreatiCode · May 25, 2023, 9:45pm

The similarity score is more useful for comparing with each other. If you choose another resume about a developer, do you get a higher similarity score?

tzortzinak95 · May 26, 2023, 9:12pm

Thank you very much for sharing! I will definitely try it out!

tzortzinak95 · May 26, 2023, 9:16pm

Yes most of the time in this case the similarity scores I have gotten are quite higher. Mostly around 0.8 to almost 0.9

Topic		Replies	Views
Why cosine_similarity between embedding vectors is always above .68 API embeddings	6	3999	March 1, 2024
Embedding Results Scale Seems Off API embeddings , ada	8	5124	December 24, 2023
Embeddings and Cosine Similarity API	20	14484	February 25, 2024
Question on text-embedding-ada-002 API	12	6423	December 24, 2023
Semantic Textual Similarity - undifferentiated similarities API embeddings , semantic-search	5	1521	December 24, 2023

Questions on the use of text-embedding-ada-002 model

Related topics