Comparing Texts Using ADA Embeddings

_hep · February 20, 2024, 1:38am

My Intention: I want to take chunks of text and compare them to understand if they match. Similar to Candidates Resumes and Job Descriptions to understand who would be a good fit.

Current Process: I’m using the ADA embeddings model to compare the texts and make a similarity score based on cosine similarity. Unfortunately, the model seems to be lacking the nuance in the text.

Can anyone help with improving accuracy in using these models to make predictions like this, in fields that have similar requirements but different offerings?

_j · February 20, 2024, 1:44am

Step 1:

compare text using text-embedding-3-large embeddings.

Step 2:

move over to language AI entity extraction, classification, search functions, for understanding beyond simply how similar the semantics and form of writing are.

anon10827405 · February 20, 2024, 1:46am

In what nuances are you finding it fails to capture?

jorgeintegrait · February 20, 2024, 2:09am

@_hep , you can also try with more advanced retrieval techniques. For capturing the value of nuances, embedding addaptors may be the most appropriate (akin to retraining an embedding model).

For this, you will want to start with a batch that are either human-validated or run through an LLM for synthetic validation. Checkout the deeplearning.ai course from Chroma on advanced retrieval to get started if you’ve never heard of this (free).

Still, for a use case like this I agree with @_j that extracting entities and mathing those is likely to render a better final result than embedding space distance.

_hep · February 20, 2024, 2:03pm

Yea, let’s stay with the job description idea. If I’m a company offering several engineering positions, like electrical, mechanical, etc. I may have several words each description that would cause a high similarity score. However, the education needs and the work experience type are different. The embeddings fail to recognize these smaller difference and give high similarity based simply on the similar wordings.

I either need to be able to weight different words more heavily, or be able to train it further, or as @_j mentions I’ll need to add additional types of methods.

_j · February 20, 2024, 2:06pm

You are correct. You cannot prompt embeddings to align its goals with yours.

You can do that with a language model.

anon10827405 · February 20, 2024, 2:49pm

You can pre-process the document for the model to interpret easier.

What it seems to me that you need is metadata for filtering/additional computing, and also a predefined structure which can expand certain segments (like education).

Finally, I would use a re-ranker to target specific fields. So you can target a broad idea and then narrow down by focusing on the nuanced differences & using a more powerful model.

The idea here is that the metadata can help capture names unknown to the model (such as universities), then you can use a re-ranker to “focus” on the nuances you are looking for.

curt.kennedy · February 20, 2024, 3:49pm

Because resumes and job descriptions are heavily keyword centric, I would lead the search with a keyword based algorithm.

I implemented a fancy log-normalized TF-IDF algorithm last year.

Here is an excerpt from my journal on this, but it’s information theoretic centric and efficient computationally since it uses set intersections in memory for the search. Sorry if it lacks some context, it’s just ripped straight from my journal But feel free to ask questions.

Curt’s Journal (Aug 4, 2023):
The idea is you find joint keyword information of a set of keywords in each of your documents (or chunks of text) just by set intersections. But also leverage “full information” and reach a combined joint Maximum Information Cross-Correlation (MIX) to reach the desired correlation.

The desired correlation has the most keyword matches, and weighted by the frequency of each keyword. The information content is measured by the rarity of the word (rare = more information) and the frequency of the word (more frequent the word is in the input/output correlation, the higher the MIX). This is mathematically expressed, for a single common word in the intersection, as log(1 + r)*log(N/R) see below for more details.

Then the words are ranked by their information content and their frequency in a document (MIX).

The information/frequency of a word W in document D is then log(1+r)*log(N/R), where r is the frequency of W in D, and R is the total number of documents word W is in, and N is the total number of documents. Note: “log” here is the natural log, or log base e = 2.718281828459… See the reference screenshot above and compare with the IDF formula for the BM25 Wikipedia entry. In this same sprit, we use the same formula, and add a word frequency term log(1 + r). This was chosen intuitively initially as the sum of 1 + 1/2 +1/3 + 1/4 +… up to the frequency of the word. But this is approximated by the logarithm, and this is compatible with the other logarithm containing the rarity of the word. So this algorithm emphasizes joint common information in the frequency of the word, weighted by the rarity of the word. Both of these are added up in the input words and target correlation documents to find the perfect keyword search match. The diminishing harmonic sum penalizes words mentioned “too much” in a document but also takes into account the rarity of the word with the log(N/R) term. This log(1 + r) term is believed to be “more fair” than a straight linear term ‘r’. The 1 is added in log(1 + r) to represent 0 information when 0 occupancies of something happens … so even though this doesn’t happen in practice (due to filtering in the in-memory data construction) is still makes mathematical sense in the general case.

So you would implement something like this. Then combine it non-coherently via RRF or RSF with the embedding leg. You would have a slider that weights between this and the embeddings, that you would have to tune, which would properly weight the joint ranking between the two.

My guess is you would favor the rankings somewhat higher for the rankings from the keyword based algorithm, over the embeddings, simply because resumes and job descriptions are keyword minefields. It’s part of the game, essentially SEO-ing your resume to satisfy the matching algorithms deployed by recruiters these days.

Diet · February 20, 2024, 4:55pm

I had a product in 2022 that was relatively robust

this was part of one of the earlier davinci model prompts:

\n\n\n
## Given the information above, fill the following JSON object. It's ok to have empty arrays. For key concepts, if there is a mention of 'something similar', include a word that describes all concepts that are meant. Don't include whitespaces in JSON output.
{
  "Job Title": string,
  "Company": string,
  "Job Post Date": string,
  "Pay Range": string,
  "Location": string ,
  "Remote Job": "yes" | "no" | "unknown",
  "Job Responsibilities": {"description": string, "key concepts": string[]}[],
  "Required Experience": {"description": string, "key concepts": string[], "years"?: number}[],
  "Preferred Experience": {"description": string, "key concepts": string[], "years"?: number}[],
  "Required Education": {"description": string, "key concepts": string[]}[],
  "Preferred Education": {"description": string, "key concepts": string[]}[],
  "Required Certifications": {"description": string, "key concepts": string[]}[],
  "Preferred Certifications":  {"description": string, "key concepts": string[]}[]
}

I ran a similar prompt on the subjects’ original CV sections, used embeddings to correlate the key concepts and description. That was then visualized in a matrix showing which past experience matched with which prior project. You could then easily compute a percent match, but I used it to rewrite the CV in the language the job description used to optimize ATS ranking.

So, in summary:

extract elements with LLM (maybe multiple passes)
element-wise correlation with embeddings
binary match score

anon10827405 · February 20, 2024, 5:02pm

I do love me some hybrid search but I’d argue against this being a solution.

Resumes are structured. This provides a plethora of advantages to be taken advantage of.

We can benefit by embedding both the segments AND the whole resume by using a database like Weaviate. So we can perform an initial holistic similarity match, and then start to focus on segments to capture the nuanced semantic differences.

We can also perform typical SQL queries on this data as it can be transformed into a schema object. So keyword search may not provide the benefits in comparison to using SQL-like queries.

Going back to the issue: The embedding model fails to capture the semantic nuances (I believe). I don’t think this is an issue of saying “X University”, or “Y University”.

Yes. This can be easily described as: Use Large Language Models with UNSTRUCTURED semantics. Separate & categorize as much as possible.

Areas like Work Experience require an SQL-like query to be combined with semantics as it not only involves the work title but date & time spent.

Diet · February 20, 2024, 5:27pm

Unfortunately I don’t have any screen shots on hand,

but you can just take the hadamard product of the binary subject x job matrix and a broadcast of the project times. you can then do a row sum to get the time spent on the concept.

anon10827405 · February 20, 2024, 8:37pm

Would you mind explaining a bit more for a simpleton like myself?

curt.kennedy · February 21, 2024, 1:32am

The SQL queries seem nebulous to me. I’d have to see some examples. But breaking the resume into chunks, and overall is good.

This is close to my approach, but correlating information overlap in input to targets, not element-wise with embeddings.

I’d have to see an example, not following the binary stuff.

One of the things I forget to mention with my approach, is that an intermediate product is the set of highly correlated keywords, non-correlated keywords, and stop-words. Essentially three (3) disjoint sets.

So highly correlated keywords are used in search. BUT … the non-correlated rare keywords could be of interest. So things like universities, specific past companies, etc, not specifically called out in your data would form this set.

So ranking everything based on it’s information content, and high information overlapping (the search) and high information non-overlapping (things you are probably interested in) would drive follow-on searching strategies, perhaps even involving the LLM to reframe things, or extract these extra important things that aren’t in your target data set.

So, when you get into the details internally, you learn what you know, and also learn about what you don’t know, or have mapped.

This could be useful to know and leverage.

Example:

“I am an AI engineer that went to Princeton and worked at Intel.”

Correlated Keywords: [“AI”, “engineer”]
Non-correlated Keywords: [“Princeton”, “Intel”]
Stop words: [“I”, “am”, “that”, “went”, “to”, “and”, “worked”, “at”]

So to leverage this, pull all sentences, or paragraphs, that contain the Non-correlated keywords, and have the LLM summarize the content for you.

Also, you can use resumes as training data for this system. So you upload each resume as a document. You label this resume, “AI engineer”, “Nurse”, “CEO”, whatever, then you correlate with this information approach, and jointly with embeddings, for the recommended classification for the new unseen resume.

Then after you gather enough new ones, you enter them into the pile, with the correct classification especially if it was wrong, to train the system for future encounters with new resumes.

anon10827405 · February 21, 2024, 2:14am

If the person is looking to capture keywords I would still look at metadata & filtering for the sole reason that they can transform the resume to their structure and perform more controlled, layered logical operations on it.

I think as a baseline of simplistic operations both keywords and SQL-like queries are quite similar. BUT, what if you want to layer additional operations like date ranges or need to capture specific keywords in a specific context?

2007 - 2014
Professional Table Tennis Player

Balled hard in numerous countries & universities like Princeton, Harvard, Ivy. Made me independent, blah blah blah

But, I think the intent here is to match job descriptions to resumes. So, yeah . Not sure if what I’m saying applies

I guess maybe this could still apply if this autonomous job-matching service(?) wanted to apply filters such as “Person has worked something law-related in the past 5 years”

Side Note: To the OP, if you are trying to build this kind of service DM me!!! I have something similar to this built!! (not the descriptions part but I have job postings aggregated from numerous sources all structured the same way!)

curt.kennedy · February 21, 2024, 2:29am

Agree, but the question is, at least for me, how do you generate keywords, and what consists of the metadata, and how is the metadata used exactly?

So what I am doing generates the keywords tuned you your documents, not using any assumptions. So you know your keywords, you also know your stop words, also tailored to your data by setting an information threshold, either in log space or linear space.

Then the other information bearing words are those that do not match to your keywords or to stop words.

So “filtering” is the correlation of these keywords, to your database, and also embeddings if you want to capture semantics. You could mix the two (embeddings (dense vectors) and keywords (sparse vectors)), or lead with one, like lead with keywords, then focus on semantics within the high keywords subset. This is also a filter.

For time gating, this could be done beforehand in the memory structure, or at runtime if the timestamp of each entry is available in memory.

You could even form “silos” of differing information corpus’ that encompass different domains or modes of information. Do this to avoid cross-contaminating across disparate branches of information. For example, information on becoming an engineer vs. information on getting a job in engineering are different silos, because they have different intent behind them.

A broader control plane is needed to classify intent and direct the information to the correct silo.

anon10827405 · February 21, 2024, 2:46am

The LLM would be able to create these keywords when transforming the uploaded resume into the structure. The keywords would be provided from the given list of resumes from the initial “wide-net” search.

2007 - 2014
Professional Table Tennis Player

Balled hard in numerous countries & universities like Princeton, Harvard, Ivy. Made me independent, blah blah blah

Could be converted to

{
  date_range: [2007, 2014],
  title: "Professional Table Tennis Player",
  tags: ["travel", "competitive", "table tennis"]
  ...
}

So really it’s whatever OP decides is worth using if they are intending to add filters.

Fair point. So really it boils down to what exactly the OP intends to do here. If keywords are sufficient, I would go for it.

If it’s what I am thinking (an autonomous matching service) then the filters would provide additional services that may be necessary for certain questions Have they worked as an Excavator in the last 5 years?. Would be as simple as viewing how many currently found resumes have excavator in tags or title (probably a lot if they are looking for machine operators), and then potentially running date ranges.

curt.kennedy · February 21, 2024, 2:59am

Yes! And this was the example I had in mind to make a “smart” system.

For example, silly example , but suppose the user uploaded a joke resume:

2003-2007 President of the United States
1990-2003 Covert Assassin, similar to Matt Damon in the Borne series
1985-1990 Social media influencer, before the internet!
??? Director of Blade Runner. I swear!

OK, great. So you probably see these all the time, classify it to your JOKE_SILO. And respond back with … here are available jobs that match your credentials … and spit out a random set of joke job titles to match from this silo

You do this instead of fuzzily match to your true data, and look like a fool.

This intent can be formed my matching (keyword/embeddings) to previously labeled resumes (the filter) and then puts you into a silo for further downstream processing.

But it all boils down to the use case.

But going back to the OP’s issue:

I see this all the time on the forum. We have users correlated against a large, unstructured set, and then get lackluster results.

To solve this, you need a filter. And I am saying filter on correlating to previously labeled things, take argmax or some average to previous. Then the user falls directly into a specific information silo (this is like a generalized metadata key), but calls into RAM a unique set of information to focus on. Then you focus on this, with whatever search, like embeddings or keywords, or both. You can also have the LLM hypothesize other queries within that silo (like HyDE) or redirect to another silo depending on the objective.

I think we are saying the same thing … but are implementing it differently. I am relying more on correlation with the past, without relying on the LLM, because I think the LLM filtering route is risky because of hallucinations.

The LLM’s role is just the “prompted speaker” role, which only outputs information retrieved for it in the prompt through all the correlation pre-filtering, and additional deeper filtering that I am talking about.

People would say my approach is risky because I haven’t labeled everything. While true, my AI/ML pipeline requires labeling of past data periodically to update the search pre-filtering. This also doesn’t involve a fine-tune either. Just pure correlation on previous ground truth.

So I am putting more faith in search by using stable tools, like multiple embedding engines combined coherently, that can take an outage of any K < N models. Plus faith in keywords, if anything, as a worst case backup if all embedding endpoints are down …

Also eases my cognitive load when encoutering fine-tunes … a recent flub of mine was exposed when I switched a fine-tune late last year to Babbage-002 by just training with the previous JSONL file, and we are getting unexpected results, that weren’t caught for a few months. So some good ole’ scar tissue here.

Diet · February 21, 2024, 9:28pm

so this is the correlation of a subject’s history with the atomized job description requirements (after separating conjunctions and all that stuff)

let’s call it M, either it matches or it doesn’t. experience contains projects, education, etc.

we then extract the time from the experience

broadcast hours times requirements.length

hadamard or elementwise multiplication H o M, you can think of it as a dropout mask or something

and then you get the row sum

and then you have experience on requirement.

makes sense?

also, happy cake day!

frz7 · September 12, 2024, 11:09am

Hi @Diet,

First, thank you to you and all the participants in this thread, it is very helpful!

I have some questions about your approach, I’m not sure I understand everything

Do you generate one json per CV, or one json for each CV section like (experience 1, … n, skills section etc) ?

Are you generating an embedding for each key concept from a CV and a job description ?

So the matrice M should like this:

	C1 (Project management)	C2 (Software development)	C3 (Agile)
J1 (project management)	1	0	0
J2 (software dev)	0	1	0
J3 (Agile)	0	0	1
J3 (Devops)	0	0	0

Thanks in advance for your clarification

Diet · September 12, 2024, 5:46pm

Welcome to the community!

It’s been a while (getting close to two years now!)

But I think you got the gist of it.

I think embeddings were used to match up the key concepts.

Jobs are broken up into different sections, like qualifications, what you’ll be doing, requirements, etc.

Each of these sections have key concepts, and the section they’re in generally just indicate whether it’s required or optional.

so it would be more like this (each job can fill multiple reqs)

Column 1	Column 2	Column 3
	J1: Tech Lead	J2: Enterprise Dev
Req1: Project Management	TRUE	_
Req2: Jira	TRUE	_
Req3: Rational Team Concert	_	TRUE
Req4: AI frameworks	TRUE	_
Req5: Jakarta EE	_	TRUE
Req6: SDLC	TRUE	TRUE

One json per job (I did it the other way round than what you’re asking) - the personal experience repository would be augmented as more jobs get filed and people remember “oh, i did that too, I guess I do have experience in that”.

I don’t think you need to split up a CV unless you’re hitting the token limit, so you might be able to get away with a single gen per person, if that’s your question. While some CVs are pretty long and complex, the only limiting factor is the output length, with a good structured prompt.

I hope I answered everything, let me know if I skipped something or you have further questions

Topic		Replies	Views
The length of the embedding contents API	48	33425	December 13, 2023
RAG is failing when the number of documents increase API	35	17189	December 17, 2024
How to feed data for completions, instead of using prompt/answer fine-tuning format? API	25	17566	December 17, 2023
How can I use Embeddings with Chat GPT 3-5 Turbo Prompting	39	48103	December 12, 2023
Train (fine-tune) a model with text from books or articles API	62	27648	November 30, 2023

Comparing Texts Using ADA Embeddings

Related topics