(Unofficial) Weekend Project / Hackathon 1: JFK Files

PaulBellow · March 21, 2025, 10:09pm

Who can dev the most interesting thing with OpenAI tech (API) + recently released JFK files?

Ready, steady, go!

jochenschultz · March 22, 2025, 2:10pm

Let’s go yay

Well, what I did was first creating a file downloader script:

Which gave me all the pdf files into one folder on my local machine

Then I set up a simple pipeline like this:

docker-compose:

minio + rabbitmq + n8n + neo4j + postgresql

setting up rabbitmq (AMQP messenger)

set up an exchange
create a Queue named IncomingQueue and bind it to that exchange

setting up minio (S3) storage

added a bucket called incoming bucket
added an event (file PUT event) which triggers an AMQP message every time a file is put into the bucket and send it to the IncomingQueue

creating a n8n workflow

creating 2 python services

one is invoked via a post request with a file location and bucket - takes the pdf from minio then split that into single pages and create a png for each and sends all png to the incoming bucket (which triggers another file PUT event message)
and one grabs the result of an image analysis (using gpt4o) and with a little data transformation creates nodes in neo4j
the neo4j importer also has tesseract installed - so it stores a geojson from the OCR and spacy (Named Entity Recognition) to extract persons, dates and locations

Then I started a testrun and (without NER and tesseract) and asked gpt4o for a document classification and text extraction with ~1000 pages

The center is the folder in which the files are located.

1000 because that should make up for statistical relevance.

With that I had a list of different document types:

Then I used a OCR test thing I made a couple weeks ago with a few prompts with o3-mini to run a few files to see how well the geojson extraction of tesseract works on the files.

Well, and a few minutes ago I finally got spacy working (stupid en_core_web_sm model *§%§$&§%&“”! - that really took me some time to install)

But NER seems to work now.

Data preparation is the main task…

I’ll make a CustomGPT soon. This way everyone can chat with the files.

Why did I do this? This looks overengineered af!

There is a pdf with 3000+ pages and the pipeline can easily handle that.

ahh and what are the other circles around that file (with the NER extraction)

Here is the prompt that I am using atm:

You are an intelligence analyst reviewing newly released JFK documents.  
Your task is to analyze the given page and evaluate how likely it is to provide new or previously unknown insights related to key unresolved questions about the JFK assassination.

Return the following as a single JSON object. 
Which must include the extracted text in full, the document_type and likelyhood of the 20 categories that we want to analyse.

Each point should receive a likelihood score between 1 (no relevance) and 100 (high relevance). Also fill a "highlights" field with stuff that stands out.

Something that is worth 100 would be a clear sentence like Oswald was in Mexico or he wasn't alone.

If there is no really interesting and new stuff e.g. something that might be important in the greater scheme only like manuals or descriptions or other burocratic stuff then it would be more in the 3 to 5 range. 20 would already mean there is some sort of new information clearly described. And the sentence needs to be repreted in highlights.

We have ~80k documents. There can't be more than 20 highlights that are really highlights. So the probability that something is very very very improtant is very low.

Expected output format:

{
  "text": "[Original input text, truncated to 500 characters]",
  "document_type": "[Same as input]",
  "Unknown contacts or meetings": [1-100],
  "Secret agreements": [1-100],
  "Unpublished speeches or writings": [1-100],
  "Unknown foreign operations": [1-100],
  "Covert operations": [1-100],
  "Financial transactions": [1-100],
  "Security threats": [1-100],
  "Internal conflicts": [1-100],
  "Unknown health issues": [1-100],
  "Connections to the Mafia": [1-100],
  "Media strategies": [1-100],
  "Unreleased photos or videos": [1-100],
  "Reactions to international crises": [1-100],
  "Personal correspondence": [1-100],
  "Unknown legislative initiatives": [1-100],
  "Relationships with other heads of state": [1-100],
  "Secret surveillance or wiretapping": [1-100],
  "Unspoken speeches": [1-100],
  "Internal polling or public opinion": [1-100],
  "Unknown notes or annotations": [1-100],
  "highlights": "[Key quotes, names, dates, or findings]"
}

Would really appreciate some help in terms of “what am I even looking for???”

Have no clue - except for that I know how to grab whatever information anyone wants to look for.

Here is a little diagram that shows the overall process.

flowchart LR

%% =============== Local PDF Collection ============
subgraph LocalPdfCollection["<div style='background-color:black; color:yellow; font-weight:bold; padding:6px;'>Local PDF Collection</div>"]
    direction TB
    A[FolderWithPdfFiles]
end

%% =============== Minio ============
subgraph Minio["<div style='background-color:black; color:yellow; font-weight:bold; padding:6px;'>Minio</div>"]
    direction TB
    B[Minio Bucket: 'incoming']
    C[PUT Event => AMQP message]
end

%% =============== RabbitMQ ============
subgraph RabbitMQ["<div style='background-color:black; color:yellow; font-weight:bold; padding:6px;'>RabbitMQ</div>"]
    direction TB
    D[Exchange]
    E[Queue: 'IncomingQueue']
end

%% =============== n8n ============
subgraph n8n["<div style='background-color:black; color:yellow; font-weight:bold; padding:6px;'>n8n</div>"]
    direction TB
    F[Workflow controlling data flow]
end

%% =============== Python Services ============
subgraph PythonServices["<div style='background-color:black; color:yellow; font-weight:bold; padding:6px;'>Python Services</div>"]
    direction TB
    P1[Python Service: PDF → PNG]
    P2[Python Service: Analysis & Transform]
end

%% =============== Analysis Part ============
subgraph AnalysisPart["<div style='background-color:black; color:yellow; font-weight:bold; padding:6px;'>Analysis Part</div>"]
    direction TB
    O1[Tesseract OCR]
    O2[OpenAI image analysis]
    O3[Spacy NER]
    O4[PostGIS grouping on PostgreSQL]
    O5[Cypher in Neo4j]
    O1 --> O2
    O2 --> O3
    O3 --> O4
    O4 --> O5
end

%% =============== Faiss Index ============
subgraph FaissIndex["<div style='background-color:black; color:yellow; font-weight:bold; padding:6px;'>Faiss Index</div>"]
    direction TB
    FX[Faiss for embeddings]
end

%% =============== RAG Chat ============
subgraph RagChat["<div style='background-color:black; color:yellow; font-weight:bold; padding:6px;'>RAG Chat</div>"]
    direction TB
    RC[Python Service: GPT-based RAG Chat]
end

%% =============== Databases ============
subgraph Databases["<div style='background-color:black; color:yellow; font-weight:bold; padding:6px;'>Databases</div>"]
    direction TB
    PG[PostgreSQL]
    N4J[Neo4j]
end

%% Flows
A --> B
B --> C
C --> D
D --> E
E --> F
F --> P1
F --> P2
P1 --> B

P2 --> AnalysisPart
AnalysisPart <--> PG
AnalysisPart <--> N4J

AnalysisPart --> FX
FX <--> RC
N4J --> RC

PaulBellow · March 22, 2025, 6:01pm

Me thinks you’re gonna win by default!

I’ll take a closer look a bit later when I have some more time.

Looks impressive, though! Maybe look for Umbrella Guy? Grassy knoll?

jochenschultz · March 22, 2025, 6:04pm

The simple keyword searches wouldn’t need AI. Just OCR and sed + awk.. could probably do that with a oneliner on cli.

Hey, let’s ask chatgpt to solve this with one line of code

update

for f in *.pdf; do 
  echo "Processing: $f"
  pdftoppm "$f" temp_page -png && \
  for img in temp_page-*.png; do 
    tesseract "$img" stdout 2>/dev/null | grep -i "lee harvey oswald" && echo "  Match in $f (page ${img#*-})"
  done
  rm temp_page-*.png
done

what I am looking for is stuff like “why was this stupid looking command required to be signed by a cia agent with a very high position”…

unusual stuff…

*ok technically OCR is also AI

jochenschultz · March 22, 2025, 8:02pm

The system is highly dynamic. Can be used for medical research, financial analysis, trading,.. pretty sure it will solve some real world problems.

Daller · March 22, 2025, 10:00pm

I find the method especially interesting, the time is certainly not wasted, even if the topic doesn’t interest everyone, it’s the technique and the knowledge what then will be always there. If one had to do it manually, it would definitely be too much work and time, so why not use an LLM for what it can do. There are also 1 or 2 people using their GPUs to kill zombies…
I’ve wanted to do this kind of data analysis for different things for decades. Now, finally, it’s possible.

phyde1001 · March 23, 2025, 1:52am

The real challenge is finding the right problems and the connections to them…

This is indeed the point I have been trying to get over with my macros… And indeed what OpenAI have been implementing:

In the age of AI everything is a one liner… The Challenge instead becomes finding the right problems and unravelling them from the mess they are in.

I guess what I’m suggesting is that a real challenge might not be showing how to delve into an abstract dataset but linking up datasets with real-world positive impact use-cases.

I mean I hope this isn’t seen as ‘self-promotional’ in any way, maybe this is a longer term plan of OpenAI GPTs or maybe I missed something else…

But delving into HS Codes and Tariffs data or Climate Data or Conflict Data, maybe these types of projects are too challenging for the OpenAI forum but opening up real world data sources alongside the easy code, isn’t that where progress starts.

I don’t come here to knock what you are doing but to challenge it. These are the sorts of things I expected on this forum when I considered it long before OpenAI existed.

While not all ‘Developers’ might be coders, multi-faceted ‘projects’ have the potential to engage a more diverse audience with OpenAI API providing the backbone to pull it all together… Make it modular, use Leader, Regular and Member offered talents and time and have OpenAI back it with compute if it’s something they can back… It fits OpenAI goals of promoting their API right?

And yes I use my GPUs to kill zombies but more importantly socialize… While I use my brain to think about stuff… Back when I started gaming online I set my GPU to problems like this though soon realized that I was raindrops in the sea… People first… Invest in skills and compute)

liam.dxaviergs · March 23, 2025, 3:20am

always thought there was never an assasin, always thought it was the secret service agent dropping his rifle 20 odd metre away from the car and the whole things was a cover up to hide the incompetence and preserve face internationally.

jochenschultz · March 23, 2025, 5:29am

Who knows but maybe they just mined bitcoin with your GPU.

razvan.i.savin · March 23, 2025, 6:22am

They didn’t stop and trained LLMs → and greedy Zucky want to use user phones (for the greater good )

phyde1001 · March 23, 2025, 8:23am

That would have been smart, this was 5 years before Bitcoin

jochenschultz · March 23, 2025, 8:26am

I am sure that bitcoin because of its useless energy consumption is directly responsible for millions of climate change related deaths in the upcoming famines.

From my perspective there is nothing smart in it. It was and is and will ever be a scam scheme.

However - there is one thing worse than that: posting bullshit in a thread about jfk files.

Why don’t you post your own ideas on how to solve it technically or ask for the deletion of your account?

I truly believe there are exactly two types of posts acceptable in this community:

posts about technical solutions, problems and help in that regard
domain specific support (e.g. when there is something like this jfk file analysis you can add your domain specific knowledge about the topic)

jochenschultz · March 23, 2025, 8:37am

Yo wtf. Seriously what is wrong with you?

phyde1001 · March 23, 2025, 8:39am

I’m sorry Jochen, like I said I wasn’t trying to knock what you were doing, Ijust thought it would be nice to see a practical solution from your expertise.

I only ever used one forum before and members and regulars would post challenging ideas and the leaders would discuss them with the members.

This is more of a top down forum I think, there were only Leaders, Regulars and 2 blocked members posts in the thread so I joined in.

I read the ‘(Unofficial) Weekend Project’ slash Hackathon as that…So posted some ideas above.

I did contribute with the ChatGPT PDF Image Translator above…

jochenschultz · March 23, 2025, 8:41am

Then send me a private message or open a new topic like “my 2 cents about the hackathon thread”

phyde1001 · March 23, 2025, 8:42am

Well and everyone else’s of course… Weekend Projects…

jochenschultz · March 23, 2025, 8:45am

Yes, of course everyone else. Why would you start an offtopic discussion on everything that pops up?

Or even better search for a post that someone started that relates to what you want to say. How about an own category for guys like you with the title “unrelated stuff”?

And they kicked you out too?

phyde1001 · March 23, 2025, 8:46am

The title suggests this is on topic

Weekend Project Hackathon 1:

Are these top down decided projects or user contributed forum discussion?

Historically I have been encouraged to start threads like this… Now you’re saying I’m not welcome to join in?

Topic		Replies	Views
(Unofficial) Weekend Project / Hackathon 2: Glass Beads Community unofficial-hackathon	21	161	March 30, 2025
Experiment for on how multiple specialised AI bots in a network work with human supervision API	33	2445	February 26, 2025
The Perfect Walled Garden Community chatgpt	30	411	October 20, 2024
Phas -> Forest Of Thought Community project , tree-of-thoughts , reasoning , ai-reasoning , forest-of-thoughts	19	517	March 17, 2025
(Unofficial) Weekend Project / Hackathon Ideas Community unofficial-hackathon	36	344	March 28, 2025

(Unofficial) Weekend Project / Hackathon 1: JFK Files

Related topics