Search To Find Closest Match While Excluding Low Probability Results

I have a list of features and I’d like to find the feature from the list that’s the closest match with the contents of a user’s email. If there isn’t a close match, I want to know that so I can add a new feature to the list instead.

I’ve used the APIs quite a bit but never for search. It sounds like I might be able to use logprobs for this but I’m not sure how to make sense of the logprobs output for this purpose.

Am I on the right track here or is there a different approach that I should try?

It’s possible to do this with an LLM, but you may have better luck with a vector database and embedding each “feature” and then running a query against that database and having a cut-off similarity score that you can then append new suggestions is they are not “similar” enough.

1 Like

Thank for you the quick reply! So this sort of approach would be better instead?

I’m comfortable creating the embeddings and searching the database. The piece that I’m not sure about there is how to access the similarity score so that I can set a threshold for when an email should be matched with a feature.

I think you should include it as an option, kind of depends if you expect to have a “LOT” of entries in your features db.

If you are only going to have 10-20 features then sure… you can do that with the AI, but a language model suffers many of the same failings as the human brain, long lists can suffer from a lack of attention and things can get missed.

A vector database I think is the better fit as it addresses your need directly while having the traditional computation benefits of greater accuracy over large datasets.

1 Like

I see, I’m expecting to have less than 100 entries in the features database.

So if I was querying a vector database like this (example from here), would I be able to do that by altering the similarity threshold?

    const { data: features, error } = await supabaseAdmin.rpc("features_search", {
      query_embedding: embedding,
      similarity_threshold: 0.01,
      match_count: 1
    });

And I could just adjust the similarity threshold until I’m finding good matches?

Exactly that, each embedding gets placed at a location in latent space, you then create a search vector which contains your new proposed feature and then you use a bit of math to calculate the distance between your search term and each entry in the database, if non of the distances are small, then you can make the assumption that the new feature is not present in the dataset.

Many of the vector database packages out there abstract away the math and the db handling and just give you a nice “similarity” score as in the example.

2 Likes

Brilliant, thanks a lot for the insights and guidance, great to know what to look for!

1 Like

@Foxalabs hoping I can draw on your expertise for one more question. In your experience / observation, is it better to extract the text that you’re searching the database for, from a larger block of text or is it better to include the block of text, as context for the search?

So if I’m searching my features list based on the content of emails, should I ask the AI to extract a list of features that the customer has mentioned and then search my database for those or just search based on the entire email?

My instinct is that more precise search terms, without the noise of the rest of the email content, will probably work better in terms of finding an exact match. But given that extracting the list of features removes some of the context and nuance from the customer’s request, the matches might be less accurate. Or is there no general rule here?

My concern is that the similarity score will be reduced if the email is longer and contains more irrelevant information, even if the customer’s feature request is specific enough to be matched with a feature on its own. But I’m not sure if that’s how this type of search works.

I got the search working by the way, it’s crazy how little code you need for something like this now.

I find that the more specific you can be, the more defined the “location” of the embeddings is, so you are more likely to get a close match with similar semantic meaning in search terms.

An embedding of a large block of text can often be hit and miss as to which parts of that text contribute to the vector, i.e. a block of text about cats and clouds, should the vector be in the “cat” space or the “cloud” space or more likely it will be in the “cloudy cats” space… that then means when searching for “cats” you may miss an embedding because the semantics don’t align well.

Long story short, I try and keep my embedded chunks as small as possible, one method to do this that can be a little expensive but very useful, is to summarise longer text and embed the summary as it’s a shorter more “dense” semantic chunk, then include a link to the full source text within the embeddings metadata, you can then perform a search and pull back the full text for context should you want to subsequently feed that into an LLM for further processing. I realise that is not your use-case right now, but hopefully the methodology makes sense to you and you can pick and choose what parts of that you make use of.

1 Like

Great, it sounds like extracting the list of features and then searching for them is the way to go for me then. Much appreciated!

1 Like

This topic was automatically closed 7 days after the last reply. New replies are no longer allowed.