Help with project approach

cwoollven · March 13, 2024, 1:03pm

Hi, some help would be much appreciated here.

I’m looking to create a system that can decipher the relationships between various compliance standards and frameworks (e.g. ISO 27001, EU GDPR, PCI DSS, NIS CAF etc.). The goal is to identify when implementing a control for one standard could also meet the criteria of another, facilitating intelligent cross-mapping to enhance compliance efficiency.

While I’m still not sure exactly how best to approach this (any input v. welcomed), it seems the obvious logical first step is dataset preparation. How can I review and extract all compliance obligations from extensive (sometimes 90+ page long) PDF documents? Especially when the taxonomy varies wildly.

Many thanks.

anon10827405 · March 13, 2024, 3:46pm

I feel like a graph database with embeddings would be a good fit for this task. You can use something like GPT to split up the documents by their features and then embed them for some powerful tasks like grouping and attribute matching.

My first step would be to transform a chunk of the PDFs into a more digestible format like markdown and then run it through both ChatGPT and then an embedding model to see how well it’s able to understand it.

Then maybe asking GPT to form some commonalities between the documents that can be set as attributes? Maybe?

I just did a quick scan of Neo4j case studies and maybe this one kind of is relevant?

wclayf · March 13, 2024, 4:05pm

The simplest approach might be to submit 30ish pages at a time with a prompt like: “Please extract into a bullet point list all the facts from the following text:\n\n.${content}”

Then you build up (with a little manual work on your part) a list of features for each standard. Then maybe if you submit one final prompt that contains each “bullet point list” as a separate section of the prompt and ask the LLM to find all similarities that exist across multiple sets, and list those out.

At least doing this exercise might let you know how to proceed further even if it’s not a full solution to what you want.

DigitalDoge · March 13, 2024, 5:09pm

I’d be interested to see how this one progresses. I’m also just starting to delve into the world of vector BDs, and how they can be used efficiently to digest data.

A quick google search came back with this article that seems to outline a method of going from doc2vec: Document Vectorization. In my previous blog posts, I have… | by Japneet Singh Chawla | Analytics Vidhya | Medium

If anyone else has any solutions for OP, on the topic of doc2vec I’d be interested to learn more.

anon10827405 · March 13, 2024, 5:47pm

Vector databases are super cool. For tinkering & learning I’d recommend using Weaviate. They have all the cool functions, combine vectorizing data and also applying generative AI. They make it all very intuitive and easy to get going, and do all the heavy lifting so you can focus on tinkering. It’s truly a BEAST of a knowledge graph.

Then I would use Pinecone for their amazing documentation. Weaviate’s documentation is very barebones. No hand-holding at all.

What I was referring to is a Graph Database which is a bit different to a Vector Database.

I freaking love Neo4j. I think graph databases are the future. The visualization and cypher language are a perfect combination for finding and visualizing insights.

For a couple projects I am required to use SQL-likes and I don’t mind them, but Cypher is just so much nicer to use.

MATCH (p:Product)-[:CATEGORY]->(l:ProductCategory)-[:PARENT*0..]
->(:ProductCategory {name:"Dairy Products"})
RETURN p.name

VS

SELECT p.ProductName
FROM Product AS p
JOIN ProductCategory pc ON (p.CategoryID = pc.CategoryID 
AND pc.CategoryName = "Dairy Products")

JOIN ProductCategory pc1 ON (p.CategoryID = pc1.CategoryID)
JOIN ProductCategory pc2 ON (pc1.ParentID = pc2.CategoryID 
AND pc2.CategoryName = "Dairy Products")

JOIN ProductCategory pc3 ON (p.CategoryID = pc3.CategoryID)
JOIN ProductCategory pc4 ON (pc3.ParentID = pc4.CategoryID)
JOIN ProductCategory pc5 ON (pc4.ParentID = pc5.CategoryID 
AND pc5.CategoryName = "Dairy Products");

I like to use Graph Databases to structure, organize, and visualize semantic data like Social Media posts

I actually just gathered all the data I could, and then asked ChatGPT to create as many Cypher Queries as possible that could create some powerful insights. Super freaking intuitive language. () = Node, = link

Topic		Replies	Views
How I cluster/segment my text after embeddings process for easy understanding? API	13	13061	December 18, 2024
Using large PDFs to make a ChatBot API chatgpt , api	21	6440	December 15, 2023
Factual answers from graph data Prompting	2	1389	February 19, 2023
Embedding and searching from similar embeddings API	6	6661	October 27, 2023
Help Needed: Build Chat Assistant Using OpenAI + Next.js App Router + Local Docs in PDF/MDX Format GPT builders gpt-4 , chatgpt , fine-tuning , api , assistants-api	3	137	May 19, 2025

Help with project approach

Related topics