Hybrid search for Segmentation

classictablet333 · November 17, 2024, 3:33am

Hi,

We want to use Hybrid search for segmentation of a large number of pages.
Imagine a query asking " give me all the patients who took medication X".
The goal is to get back ALL the possible pages, not just the top-k.

If we have a big list of similarity scores calculated based on hybrid search, how do we choose the cut off/threshold?
Based on the query, the correct pages could be 10 or 1M or … .

Please share any suggestions you might have.
Thank you

anon10827405 · November 17, 2024, 4:22am

This would be a simple database query.

Where are you thinking the unstructured semantics would be useful here?

Diet · November 17, 2024, 4:23am

Do you think you might be better served with a SQL query?

You’re right, top K is a super craptastic parameter - there are methods out there that don’t have a top K, but they’re a little more involved.

I’d go with a tool or tool-like approach with a sql or text search or something to look up “medication X”, and maybe use a fuzzy approach as a backup to identify what “medication X” could be if you get zero records or you want to add ancillary information related to searching for “medication X”.

classictablet333 · November 17, 2024, 5:31am

Thanks @anon10827405 and @Diet for your suggestion.

I ruined my question by giving a bad example query!

The queries cannot easily get converted into sql calls, even if they could the data is not in table like structured form.

The queries are like “Identify patients with hypertension who take Metformin and have undergone cardiac surgery and report reduced arrhythmia without requiring medication escalation within six months.”

In this example, hypertension in texts could be written as high blood pressure or 189/90 BP etc… Metformin means they are diabetic but its not exactly mentioned.
Also there is no database with all possible keywords to capture existence of conditions/medications … .

That is why we wanted to add a semantic component to it.

I hope that explained the complexity.
If you had further suggestions please let me know.

Thanks

anon10827405 · November 17, 2024, 5:32am

Would you consider first structuring this data into a database for easier look ups? Or are you trying to find this only this specific data points through a vast amount of unstructured information?

classictablet333 · November 17, 2024, 5:38am

No, there are millions of pages of data and numerous queries, many of which are unpredictable—similar to a chat-based interaction with your PDFs in a RAG application. Converting the unstructured text into a database isn’t feasible, as the data representation varies significantly.

anon10827405 · November 17, 2024, 5:52am

If you are planning to run numerous queries on this massive amount of information I’d argue that organization & structure is critical here. AI can help you tremendously in performing all of this.

Your data doesn’t necessarily need to follow a strict schema. Let’s find a common denominator here: people with X, or Y, or w/e. Or simply people and some medical information. This is a sufficiently abstract level of information you want.

You can scan through this massive amount of data. Filtering first using simple classification techniques to capture if the data is worth parsing. This should reduce the text quite a bit.

Then, now with the reduced information you can build employing more powerful AI. Something that can build a profile of each person and their passage.

Next, you can use something like embeddings to understand if this character profile is even related to the health care.

So, it’s an iterative process. You have a freaking MASSIVE rock, it’s time to chip away and make it valuable. Having strong character profiles in the industry that you’re involved in may be worth building a database for. Who knows what extra gold nuggets are waiting there.

classictablet333 · November 17, 2024, 6:01am

I think this is very interesting idea!
Do you have any suggestion for the last step. For example after embedding and get similarity scores for the remaining 500K or something docs after iterative filtering, how to set a threshold. Like how much of this final list to keep for this query?

Thanks alot @anon10827405!

I saw @Diet typing, I’ll wait for their insight as well.
Thanks

Diet · November 17, 2024, 6:21am

That’s something you could use LLMs for

There’s no free lunch here, and embedding models (which themselves are also LLMs) can only encode so many concepts in a vector simply due to a lack of attention. There are theoretical approaches that let you reconstruct a document vector based on chunked partials, but I’d classify that as experimental.

It’s possible that this translates into a high blood pressure dimension, but it’s possible that this is only “seen” and encoded as “high blood pressure” if contextual information indicates that it’s important, but this also depends on the embedding model you’re using.

Alternatively, you might use this knowledge programmatically search for documents that match re ([0-9]+)//([0-9]+)[Bb][Pp] => ishighbp $1, $2, and take the union of your other high bp results.

And this operational knowledge is what your system would need to “learn” - one approach is similar to what openai does with chatgpt’s “memory” system.

You asked about top K earlier, a slightly less naive approach would be to use a cosine similarity cutoff - but that really depends on the model.

It’s a super interesting project and you got your work cut out for you! If you stick to it I’m fairly confident you’ll succeed! Good luck!

classictablet333 · November 17, 2024, 6:47am

Thanks for the suggestions @Diet!
The GPT memory idea is similar to what Ronald said as well. Could be interesting to try.

Thank you

merefield · November 17, 2024, 9:01am

I use cosine similarity threshold, but that won’t cover all hybrid results.

You then have to consider how you “promote” strong keyword results and rank them alongside the semantic results during a re-ranking step.

I think the whole point of hybrid is that you are trying to promote some good keyword search results. If those results scored low on cosine similarity, it’s no good just ranking the combined set by that because you will surely lose some of the good keyword results below your set size cut-off?

zenzeizen · November 24, 2024, 3:50pm

The instructor library could be a massive help here. I have found it instrumental in help me extract structure from unstructured reliably through the use of Pydantic classes to inform the model as to what it is supposed to be doing.

Topic		Replies	Views
Embeddings giving incorrect results API	27	7785	September 16, 2023
Document Retrieval in Large Database Community embeddings	4	4008	October 27, 2024
Best Practices for Handling Long Enum Lists in Function Calls API fine-tuning , api	13	3522	February 16, 2024
HyDE with hybrid search approaches API	16	2333	June 14, 2024
How I cluster/segment my text after embeddings process for easy understanding? API	13	13156	December 18, 2024

Hybrid search for Segmentation

Related topics