How to do embedding search on knowledge graph db?

Hi everyone:
When I try to do a disease diagnosis demo site based on OpenAI and embedding search, I got a problem: I don’t know how to search user’s embedding in knowledge graph db? it is not a simple embedding search in a vectorDB, because I need search each symptom mentioned in user’s question in the graph db, then aggregate the search result to get other symptoms(not mentioned in user’s question for a possible disease) .

The typical embedding search seems only search on one embedding in the vector DB to get a list of records, but I want to query each symptom in the graph db and also will include the symptoms’ relationship, if I use a sql to get the possible disease: select diseases from db where symptoms=‘fever’ and symptoms=‘headache’ , and the sql to find symptoms not mentioned in user’s description: select symptoms from db where disease=‘Rubella’ and symptoms not in(‘Fever’, ‘Headache’) . I am just a security engineer, knows very little about AI, I don’t know how to do this kind of embedding search in graph db using AI technology? Is there a open source framework for this kind of searching? (the detail process description, can see this page: Thanks!

1 Like

Hi and welcome to the Developer Forum!

Just spit balling ideas here (ignoring the medical diagnosis side of things, problematic area that needs to be handled sensitively) You can “aggregate” embeddings, one could average out multiple symptom embeddings and then search on that, it would give a retrieval of things that contain all of the elements in each symptom… might be an area to look into.


Basic cosine similarity search should work. Each “Medical Condition” has a set of symptoms from which you can create a single embedding. Then any patient will present with a set of symptoms for which you can also create a single embedding.

This seems like a very simple text-book example of how to do cosine similarity. You can rank order every Medical Condition from most likely to least likely in precise ordering as to which condition the patient has.

@wclayf thank you for the advice. By using this solution I think the result is not accurate, first the bot need decide which diseases should be listed from user’s question, this can use embedding similarity search solution, but if two or more diseases have the user’s symptoms, the bot should be able to get other symptoms of the disease to ask the user to provide to do further diagnose. so we should have the capability to what symptoms are included in user’s question and what symptom don’t.

@Foxabilo , thank you for the advice. I will update the question title and description soon. Would you please elaborate about your thoughts, about: “aggregate” embeddings, average symptom embeddings? or can you recommand some articles about it?

Using the embedding search, you can find the top 20 or 10 most likely conditions, and then create a union of all the other symptoms the patient hasn’t yet mentioned, and present it to the patient for them to add more to their list. Once the user updates their list of conditions you feed it in again, and get back hopefully a smaller list. It will be accurate only to the extent that “Semantic Space” itself is “accurate”

yes, I can get the most likely conditions, but I don’t know how I can get the union of all the other symptoms (not mentioned), e.g: for Rubella in the most likely list, the patient may say: “my body temperature is very high, I feel hot”, in fact it is the fever, but need do semantic analysis, do I need another embedding compare between “fever” symptom and the patient’s embedding? then I need compare patient’s embedding with all the symptoms of one disease :sweat_smile:.

The patient comes in with list if N symptoms. You create a single embedding with those N, which you search on to get top 10 conditions. Then you take all symptoms of those 10, and union them into a set, and present to patient as a pick list so they can pick more. Now they have N+X symptoms. Then you create an embedding for N+X, and pull up the 10 closest matches to that.

You can of course at that point perhaps feed in the medical text of all 10 conditions, and simply let the patient (or doctor) have a chat conversation with GPT including all that medical text as the “context” to try to narrow down to a final diagnosis.


@wclayf thanks, it is a great idea to handle the problem without introduction of more complex KG search, just let the patient to choose more symptoms, it is a workable solution :+1::+1:
(But for the future, still want to let the bot to identify the symptoms for some a condition, because bot may need to remind the patient for the additional x-ray checking.)

Yeah there are other creative ways to use embeddings. We can also rank order them by 1) severity or 2) rareness, (in this medical example of symptoms), and try to search based on top 10, 9, 8, 7, etc symptoms. Or I think it’s also possible to take the weight them all into a single combined unit vector (in addition to simple averaging) where the top symptoms dimensions are weighted heavier than the less important symptoms.

When querying a knowledge graph you may actually have a more powerful tool at your disposal than semantic search and vector embeddings.
For example: Inferring the disease (class) from it’s symptoms (attributes) is a straightforward and widely explored way to retrieve information fast, cheap and in the case of LLMs hallucination free.
You could try to leverage the capabilities of the language model to create a search for the knowledge graph database and then go ahead from there.

Hi @saaspeter. My name is Gaurav Sehgal. I’m a research student at the University of Waterloo. We have developed an open-source embedded graph database called KuzuDB. Currently, we are working on combining the capabilities of knowledge graph and vector search. If possible, I would love to have a chat with you to understand your problem statement better. Your insights would be invaluable to us.

thank you very much. I will read KuzuDB document in detail later. I am glad to discuss my question with you, and my email is:, but I think I was lack of basic knowledge about semantic process, so I still no idea how to resolve my question, I will learn some basic knowledge about nlp, then try to find the solution.

Thank you all for your advices. I will read some articles about basic nlp and combine with your advice to get my own solution. If I cannot find any better solution, I think I will use @wclayf 's advice.