Factual answers from graph data

Has anyone explored the area of getting answers for questions from graph data?

TLDR, main challenges I’m working through:

  • How to consistently follow the edges for answers that require info from more than a single node
  • How to determine what subset of the graph is needed if the approach for getting the answer needs to fit in a 2k token limit
  • Can this be done without a fine-tuned model or do we need to go down that route?

I’ve found that GPT-3 does a reasonably good job of taking a JSON list of nodes/edges and answering questions as a completion prompt if you provide example data+questions+answers and then pass the actual data+question, but doesn’t always follow the correct edges when forming answers. I’m guessing if I could provide more examples of different cases, I could get it there, but I quickly run out of room within the token window to do so.

Is this a case where a fine-tuned model is need or are there other ideas to help reinforce understanding the relationships?

The other challenge is how get the right data from a graph that is too large to fit in a 2k token window. I built a fun little set of pre-processing steps to search nodes to find ones relevant to the query and to filter the graph by node types mapped to a classifier and then adding in the appropriate edges. Does a reasonable job pairing down the graph to a fragment that better fits within the limit, but curious if others have better ideas for approaching this as well?

I’ve also tried converting the graph edges/nodes to NL and then asking questions directly against that output. This approach seems to show some promise as well and will continue to explore this. Some similar challenges around making sure the relationships are respected, but it seems like I could also do this in smaller batches (source node + edge + target node as a “document”) and run it all through the answers endpoint?

1 Like

Fascinating idea.
We worked on a similar paradigm in our application BookMapp. but we are constructing the graph of relatedness, whereas you are approaching this from the other direction.

Is the graph more or less balanced in general? If it is, it makes sense to have a notion of level 1 nodes (starting from top), level 2 nodes and so forth. Assuming that can be done, would it be a possibility to perform repeated classifications at a given node level?

Example:

  1. whether something is a person or a non-person at top level.
  2. if it is an non-person, is it living or non-living.
  3. if living, is it a plant or an animal.
  4. if animal, is it a mammal, amphibian, reptile and so forth.

I would be very excited and curious to know what comes out of your exploration.

Hey This is really interesting. My start-up is working with Gov data in the UK. If you fancy a chat about this let me know.