The working of the embeddings

The embeddings of Open AI is a black box and not much documentation is available. I did some testing between Cohere and Open AI embedding with the below three content. I found out that Cohere is giving me better control on the similarity score. It will be great to hear other opinion on this. May be I am not using the Open AI embedding correctly. The three texts that I have are

Romwe Women's Plus Size Short Sleeve Surplice Deep V Belted Ruched Mini Party Bodycon Dress
95% Polyester, 5% Spandex
Tie closure
High stretchy material with good softness, comfortable to wear
The party bodycon dress feature with wrap v neck, batwing sleeve, self tie waist and ruched detail
Good choise for party, cocktail, evening, prom, nightout, club and work
The elastic material hugs your figure perfectly and the bodycon cut creates a seductive silhouette
Please refer to the size measurement in image before ordering
Romwe Women's Plus Size Casual Drawstring Twist Front Cut Out V Neck Short Sleeve Summer Sexy Bodycon Dress
100% Polyester
Pull On closure
High stretchy, soft and comfortable
Cut out, drawstring, v neck, high waist mini dresss for women
Good choise for party, cocktail, club, date, work, holiday, casual and formal wear
Keep this formal dress with high heels and additional jewellery for a chic look
Please refer to the size measurement in image before ordering
COOFANDY Men's Muscle Fit Button Down Dress Shirt Long Sleeve
50% Cotton, 48% Polyester, 2% Spandex
Imported
Button closure
Machine Wash
【Wrinkle-Free】 High quality woven fabric, lightweight and breathable, wrinkle free dress shirts with a clean look, keeps your body dry and comfortable all day.
【Soft Cotton Fabric】This long sleeve shirts are light and comfortable to wear. Elastic fabric fits perfectly on all body type and allows greater mobility in any direction with no restriction, making you enjoy activewear levels of comfort and mobility.
【Fashionable Design】Male dress shirts always come in a variety of types. Classic solid/plaid one never goes wrong. Slim fit stylish dress shirt with classic turndown collar, button up closure, long sleeve and metal contrast buttons makes you more handsome and attractive.
【Occasions】You can pair this long sleeve button down shirts with chinos/jeans for casual daily wear, or match the stretchable shirt with dress pants for classy look. This smart shirt is essential in mens wardrobe and greats for all season, Suitable for office, business, date, night out, club, travel and casual daily wear.
【Garment Care】Machine washable. ❤The fabric of this plaid dress shirt differs from one with solid color, which is more elastic. Please refer following size chart in the product description to choose best fit for you

I then use both Cohere and Open AI to embed them and store in Supbase. Then when I run a cosine similarity with the question “List slim fit long sleeved shirts for men”. I get the below results

Open AI embedding
1st Text - 0.874352702871908
2nd Text - 0.874352702871908
3rd Text - 0.866211994130881

Cohere Embedding
1st Text - 0.753884081524262
2nd Text - 0.762143687765075
3rd Text - 0.822315609664169

I have a threshold of 0.79, so with Cohere, I am getting the right retrieval

That level of detail is not what embeddings are good at.
Embeddings are great at general area of topic, but that’s across all possible topics.
A “mens shirt” and a “woman’s shirt” is pretty close in embedding space, compared to “race car fuel kinds” or “the history of turtles.”

If you want an exact match, you should use a database.
You can even pre-process the text using a LLM to extract what the appropriate keywords are, and then do a comparative keyword match to an exact database.

1 Like

Thanks for the response. But I see that if I use cohere embeddings, I am getting much better result and a goos span. I tested these over few more documents and it is pretty consistent

Yes, different embedding bases will give different results. I’m sure there are other use cases where OpenAI embeddings will perform better than Cohere.
This is the art of engineering: Evaluate the available tools, and choose the best cost/performance solution for the problem at hand!

Are you sure you meant this? Vectors can be in a database? :sweat_smile: :thinking:

Yes, if you want an exact match, you should use a traditional database! It’s absolutely the right answer for that problem.

If you use vectors, you can still use a database, of course. Postgres has a built-in vector index, for example.

But the whole point of the comment was that, if you’re looking for exact matches (“royal green color skirt”), and particular ranges (“price between $40 and $60”) then a traditional database will give you what you want, but an embedding vector match will not.

1 Like

I still think it would be better to say “keyword match” or “traditional string search” … the DB part is moot. But ok …