Help with embeddings and semantic search

Hello
I have this issue.m I’m using the chromadb database where I store the embeddings of my products.

When performing a semantic search using numbers, it doesn’t find the product i’m looking for.

Has anyone had a similar problem and been able to solve it?

Can you give a more detailed MWE of the issue you’re having?

Specifically, an example of something you’re searching, what it’s returning, and what you expect it to return.

Also helpful will be information about what embedding model you’re using and the similarity scores of your top matches as well as the scores of what your ideal matches are.

Also, what is the context of the numbers you’re searching? As in are they document IDs, page numbers, section numbers, some other type of identifier or are they just number values inside a document?

My initial thought is you may want to use some type of hybrid search where you first filter down to the most likely relevant documents via keywords then use semantics to find the appropriate embeddings from within that smaller subset.

1 Like

I have a list of products that have various features, and one of them is the price. When I perform a search based on price, it doesn’t provide products with prices that are approximately or exactly what is being requested.

I did this by creating embeddings with langchain for each product and saved them in Chroma DB.

Are you only going to search by price or text as well? I don’t think an embedded DB makes sense for a similarity search based on a numeric value. Is your data totally unstructured? It could make more sense to try to leverage LLMs to go from unstructured to data, rather than using embedding look up.

If you are trying to search products by their properties you are looking for a typical SQL-like database.

4 Likes

Like @RonaldGRuckus said. The embedding model is semantics and meaning. Numerical comparison is in the domain of traditional computing.

So you would take in a query of “What costs close to $5 and is pink?”, and you would break this down into a database query intersecting things near the $5 mark and have pink as an attribute, so no inference or semantics required.

The hardest thing is breaking these phrases into queries, but there are folks on this forum with extensive experience on this topic.

1 Like

i am also facing this issue. i have golfcourse dataset and monstly values are in numeric form like longitude, layoutHoles, curseName etc.

I’m storing the embeddings in datastax vectordb but when perform a semantic search using either numbers or text, it didn’t show the results i’m looking for.

could anyone tell me what should i do for numeric embeddings ?? without creating any noise inn embedings and capture the sementic meaning.

There are a lot of things that could be going wrong.

Look at the correlation values, what are they?

Depending on the exact model, you may have to throw out hits that are less than a certain threshold.

Also, does your input line up with your targets? If not, you may have to transform the input, using HyDE, and then use this transformation in your correlation, instead of the raw input text.