I’m looking to create a system that can decipher the relationships between various compliance standards and frameworks (e.g. ISO 27001, EU GDPR, PCI DSS, NIS CAF etc.). The goal is to identify when implementing a control for one standard could also meet the criteria of another, facilitating intelligent cross-mapping to enhance compliance efficiency.
While I’m still not sure exactly how best to approach this (any input v. welcomed), it seems the obvious logical first step is dataset preparation. How can I review and extract all compliance obligations from extensive (sometimes 90+ page long) PDF documents? Especially when the taxonomy varies wildly.
I feel like a graph database with embeddings would be a good fit for this task. You can use something like GPT to split up the documents by their features and then embed them for some powerful tasks like grouping and attribute matching.
My first step would be to transform a chunk of the PDFs into a more digestible format like markdown and then run it through both ChatGPT and then an embedding model to see how well it’s able to understand it.
Then maybe asking GPT to form some commonalities between the documents that can be set as attributes? Maybe?
I just did a quick scan of Neo4j case studies and maybe this one kind of is relevant?
The simplest approach might be to submit 30ish pages at a time with a prompt like: “Please extract into a bullet point list all the facts from the following text:\n\n.${content}”
Then you build up (with a little manual work on your part) a list of features for each standard. Then maybe if you submit one final prompt that contains each “bullet point list” as a separate section of the prompt and ask the LLM to find all similarities that exist across multiple sets, and list those out.
At least doing this exercise might let you know how to proceed further even if it’s not a full solution to what you want.
I’d be interested to see how this one progresses. I’m also just starting to delve into the world of vector BDs, and how they can be used efficiently to digest data.
Vector databases are super cool. For tinkering & learning I’d recommend using Weaviate. They have all the cool functions, combine vectorizing data and also applying generative AI. They make it all very intuitive and easy to get going, and do all the heavy lifting so you can focus on tinkering. It’s truly a BEAST of a knowledge graph.
Then I would use Pinecone for their amazing documentation. Weaviate’s documentation is very barebones. No hand-holding at all.
What I was referring to is a Graph Database which is a bit different to a Vector Database.
I freaking love Neo4j. I think graph databases are the future. The visualization and cypher language are a perfect combination for finding and visualizing insights.
For a couple projects I am required to use SQL-likes and I don’t mind them, but Cypher is just so much nicer to use.
MATCH (p:Product)-[:CATEGORY]->(l:ProductCategory)-[:PARENT*0..]
->(:ProductCategory {name:"Dairy Products"})
RETURN p.name
VS
SELECT p.ProductName
FROM Product AS p
JOIN ProductCategory pc ON (p.CategoryID = pc.CategoryID
AND pc.CategoryName = "Dairy Products")
JOIN ProductCategory pc1 ON (p.CategoryID = pc1.CategoryID)
JOIN ProductCategory pc2 ON (pc1.ParentID = pc2.CategoryID
AND pc2.CategoryName = "Dairy Products")
JOIN ProductCategory pc3 ON (p.CategoryID = pc3.CategoryID)
JOIN ProductCategory pc4 ON (pc3.ParentID = pc4.CategoryID)
JOIN ProductCategory pc5 ON (pc4.ParentID = pc5.CategoryID
AND pc5.CategoryName = "Dairy Products");
I like to use Graph Databases to structure, organize, and visualize semantic data like Social Media posts
I actually just gathered all the data I could, and then asked ChatGPT to create as many Cypher Queries as possible that could create some powerful insights. Super freaking intuitive language. () = Node, = link