(Unofficial) Weekend Project / Hackathon 1: JFK Files

Let’s go yay

Well, what I did was first creating a file downloader script:

Which gave me all the pdf files into one folder on my local machine


Then I set up a simple pipeline like this:

docker-compose:

minio + rabbitmq + n8n + neo4j + postgresql

  1. setting up rabbitmq (AMQP messenger)
  • set up an exchange
  • create a Queue named IncomingQueue and bind it to that exchange
  1. setting up minio (S3) storage
  • added a bucket called incoming bucket
  • added an event (file PUT event) which triggers an AMQP message every time a file is put into the bucket and send it to the IncomingQueue
  1. creating a n8n workflow

  1. creating 2 python services
  • one is invoked via a post request with a file location and bucket - takes the pdf from minio then split that into single pages and create a png for each and sends all png to the incoming bucket (which triggers another file PUT event message)

  • and one grabs the result of an image analysis (using gpt4o) and with a little data transformation creates nodes in neo4j
    the neo4j importer also has tesseract installed - so it stores a geojson from the OCR and spacy (Named Entity Recognition) to extract persons, dates and locations

Then I started a testrun and (without NER and tesseract) and asked gpt4o for a document classification and text extraction with ~1000 pages

The center is the folder in which the files are located.

1000 because that should make up for statistical relevance.

With that I had a list of different document types:

Then I used a OCR test thing I made a couple weeks ago with a few prompts with o3-mini to run a few files to see how well the geojson extraction of tesseract works on the files.

Well, and a few minutes ago I finally got spacy working (stupid en_core_web_sm model *§%§$&§%&“”! - that really took me some time to install)

But NER seems to work now.

Data preparation is the main task…

I’ll make a CustomGPT soon. This way everyone can chat with the files.


Why did I do this? This looks overengineered af!

There is a pdf with 3000+ pages and the pipeline can easily handle that.

ahh and what are the other circles around that file (with the NER extraction)

Here is the prompt that I am using atm:

You are an intelligence analyst reviewing newly released JFK documents.  
Your task is to analyze the given page and evaluate how likely it is to provide new or previously unknown insights related to key unresolved questions about the JFK assassination.

Return the following as a single JSON object. 
Which must include the extracted text in full, the document_type and likelyhood of the 20 categories that we want to analyse.

Each point should receive a likelihood score between 1 (no relevance) and 100 (high relevance). Also fill a "highlights" field with stuff that stands out.

Something that is worth 100 would be a clear sentence like Oswald was in Mexico or he wasn't alone.

If there is no really interesting and new stuff e.g. something that might be important in the greater scheme only like manuals or descriptions or other burocratic stuff then it would be more in the 3 to 5 range. 20 would already mean there is some sort of new information clearly described. And the sentence needs to be repreted in highlights.

We have ~80k documents. There can't be more than 20 highlights that are really highlights. So the probability that something is very very very improtant is very low.

Expected output format:

{
  "text": "[Original input text, truncated to 500 characters]",
  "document_type": "[Same as input]",
  "Unknown contacts or meetings": [1-100],
  "Secret agreements": [1-100],
  "Unpublished speeches or writings": [1-100],
  "Unknown foreign operations": [1-100],
  "Covert operations": [1-100],
  "Financial transactions": [1-100],
  "Security threats": [1-100],
  "Internal conflicts": [1-100],
  "Unknown health issues": [1-100],
  "Connections to the Mafia": [1-100],
  "Media strategies": [1-100],
  "Unreleased photos or videos": [1-100],
  "Reactions to international crises": [1-100],
  "Personal correspondence": [1-100],
  "Unknown legislative initiatives": [1-100],
  "Relationships with other heads of state": [1-100],
  "Secret surveillance or wiretapping": [1-100],
  "Unspoken speeches": [1-100],
  "Internal polling or public opinion": [1-100],
  "Unknown notes or annotations": [1-100],
  "highlights": "[Key quotes, names, dates, or findings]"
}

Would really appreciate some help in terms of “what am I even looking for???”

Have no clue - except for that I know how to grab whatever information anyone wants to look for.

Here is a little diagram that shows the overall process.

flowchart LR

%% =============== Local PDF Collection ============
subgraph LocalPdfCollection["<div style='background-color:black; color:yellow; font-weight:bold; padding:6px;'>Local PDF Collection</div>"]
    direction TB
    A[FolderWithPdfFiles]
end

%% =============== Minio ============
subgraph Minio["<div style='background-color:black; color:yellow; font-weight:bold; padding:6px;'>Minio</div>"]
    direction TB
    B[Minio Bucket: 'incoming']
    C[PUT Event => AMQP message]
end

%% =============== RabbitMQ ============
subgraph RabbitMQ["<div style='background-color:black; color:yellow; font-weight:bold; padding:6px;'>RabbitMQ</div>"]
    direction TB
    D[Exchange]
    E[Queue: 'IncomingQueue']
end

%% =============== n8n ============
subgraph n8n["<div style='background-color:black; color:yellow; font-weight:bold; padding:6px;'>n8n</div>"]
    direction TB
    F[Workflow controlling data flow]
end

%% =============== Python Services ============
subgraph PythonServices["<div style='background-color:black; color:yellow; font-weight:bold; padding:6px;'>Python Services</div>"]
    direction TB
    P1[Python Service: PDF → PNG]
    P2[Python Service: Analysis & Transform]
end

%% =============== Analysis Part ============
subgraph AnalysisPart["<div style='background-color:black; color:yellow; font-weight:bold; padding:6px;'>Analysis Part</div>"]
    direction TB
    O1[Tesseract OCR]
    O2[OpenAI image analysis]
    O3[Spacy NER]
    O4[PostGIS grouping on PostgreSQL]
    O5[Cypher in Neo4j]
    O1 --> O2
    O2 --> O3
    O3 --> O4
    O4 --> O5
end

%% =============== Faiss Index ============
subgraph FaissIndex["<div style='background-color:black; color:yellow; font-weight:bold; padding:6px;'>Faiss Index</div>"]
    direction TB
    FX[Faiss for embeddings]
end

%% =============== RAG Chat ============
subgraph RagChat["<div style='background-color:black; color:yellow; font-weight:bold; padding:6px;'>RAG Chat</div>"]
    direction TB
    RC[Python Service: GPT-based RAG Chat]
end

%% =============== Databases ============
subgraph Databases["<div style='background-color:black; color:yellow; font-weight:bold; padding:6px;'>Databases</div>"]
    direction TB
    PG[PostgreSQL]
    N4J[Neo4j]
end

%% Flows
A --> B
B --> C
C --> D
D --> E
E --> F
F --> P1
F --> P2
P1 --> B

P2 --> AnalysisPart
AnalysisPart <--> PG
AnalysisPart <--> N4J

AnalysisPart --> FX
FX <--> RC
N4J --> RC
8 Likes