Embedding does not capture negative expression?

I am trying to use embedding to do recommendations using user provided preferences. I found that embeddings seem not able to capture negative expression, like “I don’t like…”.

I appreciate if any one can suggest some solution.

Example:

>>> m1 = 'this is a channel about football game'
>>> m2 = 'I like sports'
>>> m3 = 'I do not like football'

>>> a = client.embeddings.create(input=[m1, m2, m3], model="text-embedding-ada-002")
>>> a1, a2, a3 = a.data[0].embedding, a.data[1].embedding, a.data[2].embedding
>>> np.dot(a1, a2), np.dot(a1, a3)
(0.8131731681972549, 0.8261072236929238)

As you can see from the above example, m3 expresses disliking of football, but its similarity with football channel is very high.

This could cause me to wrongly recommend a football channel to a user who dislikes football.

The results you are seeing is because m1 and m3 are both about football which is more semantically similar than something about sports.

This is expected behaviour.

2 Likes

I think the classic answer is subtract.
So, embedding(‘sports’) - embedding(‘football’)
you will pbly have to renormalize

1 Like

Have you tried this with ada-002?

I’m asking because the delta vector (normalized) could easily put you outside ada-002 small concentrated hyper-cone, so that any new embedding would be far away from it.

The only way to get close to this would be to subtract other things, and see if it aligns with this “out of bounds” newly created vector.

Here is a rough visualization:

The new delta vector is pointing way outside of the cone in this picture. So any new things embedded will not allign with this new vector since they all live in the patch.

2 Likes

nope, haven’t tried. Valid concern, for sure.

1 Like

You suggest that openai embedding captures only the word level semantics (like word-to-word match for “football” or “sports”)?

If so, this is a little disappointing. I thought OpenAI embedding could capture sentence level semantics (like the meaning of a whole sentence).

That’s not what I’m suggesting at all. But, if you need to find the relevant context for someone who says they don’t like football, the fact a channel is about football is more important for them than it is to someone who says they like sports—if for no other reason than the model can ensure it doesn’t recommend that channel.

Do you understand now?

1 Like

That’s not what I’m suggesting at all. But, if you need to find the relevant context for someone who says they don’t like football, the fact a channel is about football is more important for them than it is to someone who says they like sports—if for no other reason than the model can ensure it doesn’t recommend that channel.

Thanks a lot for helping!

I am a newbie in this domain. Here is my thought:

  • For relevance (always positive), what you said makes sense. m1 and m3 are definitely relevant.
  • Similarity (cosine or dot product) could be negative and in this case should be negative.

Maybe all these don’t matter. The import thing for me is to find a solution for my recommendation. :grinning:

Hmm, did you just sneak a deathstar into the chat?

1 Like

The keyword is “essence” of semantics.

It’s slightly counter-intuitive. I had this initial pitfall as well. My first attempt was “Hot”, and “Cold”, thinking they would be completely different. They are, but only in a certain method of measurement. By their essence they are very similar: They both represent temperature, they both can be used as measurements, they both can be used to describe items, people. They can cause injury.

Imagine you drew “Hot”, and “Cold”, or “Likes”, and “Dislikes”, and then you had to create a brainstorm of all the meanings behind it. You would find that they truly share a lot of the same characteristics.

Same with “likes” and “dislikes”. Both carry the same meaning in essence. The embedding model does not perform the logic that you intuitively want it to.

What you are looking for is 2 separate classifications. One for the sport, one for the preference. This can be done with embeddings and/or LLMs.

You can set points in the embedding space and then see how these items compare. I’m not going to try and fluff the numbers, you can see that “no preference” isn’t perfect, so false positives are an issue.

You would also need to classify all the sports and perform the same comparison tests.
You could put this all together with a fine-tuned model to output {PREFERENCE-SPORT} as well, up to you. Honestly a base model would probably work fine.

But, Completion will soon be gone as OpenAI covers up the ability to spit training data verbatim.

3 Likes

With the embeddings, you should get better results if you correlate to previously known (labeled) embeddings.

So if your input is:

“I do not like football”

And your previous dislike embeddings for the label “Football - Dislikes” are:

“Football is bad”
“Football is dumb”

Etc.

Then you correlate your input to all previous labeled embeddings, and select the category corresponding to the highest correlation.

So you can do this with labeled embeddings, instead of a fine-tune, even though the fine-tune isn’t a bad idea either.

The nice thing about the embeddings classifier, is you can add or remove embeddings on the fly, where as there is no easy “undo” operation in a fine-tune.

2 Likes

I do like this, in theory.

I tried a quick mash-up by throwing them all together, calculating the centroid and it works similar to what I suggested above (I used ChatGPT to generate 5 likes and dislikes statements)

Much better for combining the sport and preference, to be fair.

How do you plan to remove non-preferential/unrelated queries with this? They would be mixed inside everything else

1 Like

The “None” case could be inferred by thresholding the correlation.

So if max is less than 0.8 (for example), then declare the “None” case. Just pick a good threshold based on the data you are seeing.

With enough labeled embeddings, you would expect something to pop above 0.8 (or whatever the threshold should be).

With enough labeled data, over time, you can push the threshold up, to say, 0.9.

So this is a data driven approach that needs adjusting based on how much labeled data you have.

1 Like

Nice. Yes this does work quite nicely :heart_eyes:

I like this a lot. Much more dynamic.

1 Like

Yes, also this embedding classifier approach costs less than running a fine-tune.

Also, this approach can even utilized without web access if you run small embedding models locally.

Also easily scalable to run in paralell (shard each embedding set and run in parallel).

So more dynamic + less FLOPS (smaller hardware footprint) + local options for remote applications + easily scalable.

Also you can hybridize this with multiple embedding engines at the same time and combine the rankings with RRF. You can also run a non-inference based keyword version as absolute worst case backup if all your inference paths are down.

So in ops, your system is not reliant on only one embedding engine, it can run several at once, and any embedding engine outage won’t take you down (or rely on your worst case non-inference local Keyword backup). So you get as much redundancy as you want and massive uptimes.

So lots more advantages than meets the eye :face_with_monocle:

1 Like

And the powerful analytics from a transparent galactic box of queries already in their respective position :raised_hands: Any edge-cases could be easily routed to an LLM worst case.

I’m an embedding believer. All the way.

What do you think of RSF (Relative Score Fusion)?

In developing these two algorithms, we carried out some internal benchmarks testing recall on a standard (FIQA) dataset. According to our internal benchmarks, the relativeScoreFusion algorithm showed a ~6% improvement in recall over the default rankedFusion method.

Which apparently (haven’t tried this) works well with this cool newish feature of “Autocut”:

In 1.20 we introduced the AutoCut feature, which can intelligently retrieve groups of objects from a search. AutoCut relies on there being natural “clusters” (groups of objects with close scores).

AutoCut works well with relativeScoreFusion, which often results in natural clusters that autocut can detect.

1 Like

It makes sense. It basically keeps the relative rankings in-tact and not just ordering by unit integer distances (so use 1, 0.33, 0 and not 1, 2, 3).

Also should mention, one really cool thing about RSF over RRF is that when you have a case where the embeddings are all similar, but the keyword leg has more differentiation, the keyword leg will determine the winner, which is a trait that I really like!

So you don’t get messed up results if one thing is relatively close, across the board, in its ranking stream.

Also, I like that there is no reciprocal, or division, it’s just a straight up weighted sum. So if you partition the weightings to sum up to 1, you get an absolute score from 0 to 1, which is nice! :ok_hand:

Intuitively, it’s a better fusion algorithm, so I’d have to try it out! :rofl:

1 Like

I find that every time I try to use actual scores for fusion, I quickly run into a counter-example where it does the reverse of what I want. sigh. that weaviate autocut feature sounds pretty cool, though…