Hybrid search for Segmentation

Hi,

We want to use Hybrid search for segmentation of a large number of pages.
Imagine a query asking " give me all the patients who took medication X".
The goal is to get back ALL the possible pages, not just the top-k.

If we have a big list of similarity scores calculated based on hybrid search, how do we choose the cut off/threshold?
Based on the query, the correct pages could be 10 or 1M or … .

Please share any suggestions you might have.
Thank you

This would be a simple database query.

Where are you thinking the unstructured semantics would be useful here?

3 Likes

Do you think you might be better served with a SQL query? :thinking:

You’re right, top K is a super craptastic parameter - there are methods out there that don’t have a top K, but they’re a little more involved.

I’d go with a tool or tool-like approach with a sql or text search or something to look up “medication X”, and maybe use a fuzzy approach as a backup to identify what “medication X” could be if you get zero records or you want to add ancillary information related to searching for “medication X”.

3 Likes

Thanks @anon10827405 and @Diet for your suggestion.

I ruined my question by giving a bad example query!

The queries cannot easily get converted into sql calls, even if they could the data is not in table like structured form.

The queries are like “Identify patients with hypertension who take Metformin and have undergone cardiac surgery and report reduced arrhythmia without requiring medication escalation within six months.”

In this example, hypertension in texts could be written as high blood pressure or 189/90 BP etc… Metformin means they are diabetic but its not exactly mentioned.
Also there is no database with all possible keywords to capture existence of conditions/medications … .

That is why we wanted to add a semantic component to it.

I hope that explained the complexity.
If you had further suggestions please let me know.

Thanks

2 Likes

Would you consider first structuring this data into a database for easier look ups? Or are you trying to find this only this specific data points through a vast amount of unstructured information?

2 Likes

No, there are millions of pages of data and numerous queries, many of which are unpredictable—similar to a chat-based interaction with your PDFs in a RAG application. Converting the unstructured text into a database isn’t feasible, as the data representation varies significantly.

1 Like

If you are planning to run numerous queries on this massive amount of information I’d argue that organization & structure is critical here. AI can help you tremendously in performing all of this.

Your data doesn’t necessarily need to follow a strict schema. Let’s find a common denominator here: people with X, or Y, or w/e. Or simply people and some medical information. This is a sufficiently abstract level of information you want.

You can scan through this massive amount of data. Filtering first using simple classification techniques to capture if the data is worth parsing. This should reduce the text quite a bit.

Then, now with the reduced information you can build employing more powerful AI. Something that can build a profile of each person and their passage.

Next, you can use something like embeddings to understand if this character profile is even related to the health care.

So, it’s an iterative process. You have a freaking MASSIVE rock, it’s time to chip away and make it valuable. Having strong character profiles in the industry that you’re involved in may be worth building a database for. Who knows what extra gold nuggets are waiting there.

2 Likes

I think this is very interesting idea!
Do you have any suggestion for the last step. For example after embedding and get similarity scores for the remaining 500K or something docs after iterative filtering, how to set a threshold. Like how much of this final list to keep for this query?

Thanks alot @anon10827405!

I saw @Diet typing, I’ll wait for their insight as well.
Thanks

1 Like

That’s something you could use LLMs for :thinking:

There’s no free lunch here, and embedding models (which themselves are also LLMs) can only encode so many concepts in a vector simply due to a lack of attention. There are theoretical approaches that let you reconstruct a document vector based on chunked partials, but I’d classify that as experimental.

It’s possible that this translates into a high blood pressure dimension, but it’s possible that this is only “seen” and encoded as “high blood pressure” if contextual information indicates that it’s important, but this also depends on the embedding model you’re using.

Alternatively, you might use this knowledge programmatically search for documents that match re ([0-9]+)//([0-9]+)[Bb][Pp] => ishighbp $1, $2, and take the union of your other high bp results.

And this operational knowledge is what your system would need to “learn” - one approach is similar to what openai does with chatgpt’s “memory” system.

You asked about top K earlier, a slightly less naive approach would be to use a cosine similarity cutoff - but that really depends on the model.


It’s a super interesting project and you got your work cut out for you! If you stick to it I’m fairly confident you’ll succeed! Good luck!

3 Likes

Thanks for the suggestions @Diet!
The GPT memory idea is similar to what Ronald said as well. Could be interesting to try.

Thank you

1 Like

I use cosine similarity threshold, but that won’t cover all hybrid results.

You then have to consider how you “promote” strong keyword results and rank them alongside the semantic results during a re-ranking step.

I think the whole point of hybrid is that you are trying to promote some good keyword search results. If those results scored low on cosine similarity, it’s no good just ranking the combined set by that because you will surely lose some of the good keyword results below your set size cut-off?

2 Likes

The instructor library could be a massive help here. I have found it instrumental in help me extract structure from unstructured reliably through the use of Pydantic classes to inform the model as to what it is supposed to be doing.

1 Like