(Unofficial) Weekend Project / Hackathon 1: JFK Files

Who can dev the most interesting thing with OpenAI tech (API) + recently released JFK files?

Ready, steady, go!

5 Likes

Letā€™s go yay

Well, what I did was first creating a file downloader script:

Which gave me all the pdf files into one folder on my local machine


Then I set up a simple pipeline like this:

docker-compose:

minio + rabbitmq + n8n + neo4j + postgresql

  1. setting up rabbitmq (AMQP messenger)
  • set up an exchange
  • create a Queue named IncomingQueue and bind it to that exchange
  1. setting up minio (S3) storage
  • added a bucket called incoming bucket
  • added an event (file PUT event) which triggers an AMQP message every time a file is put into the bucket and send it to the IncomingQueue
  1. creating a n8n workflow

  1. creating 2 python services
  • one is invoked via a post request with a file location and bucket - takes the pdf from minio then split that into single pages and create a png for each and sends all png to the incoming bucket (which triggers another file PUT event message)

  • and one grabs the result of an image analysis (using gpt4o) and with a little data transformation creates nodes in neo4j
    the neo4j importer also has tesseract installed - so it stores a geojson from the OCR and spacy (Named Entity Recognition) to extract persons, dates and locations

Then I started a testrun and (without NER and tesseract) and asked gpt4o for a document classification and text extraction with ~1000 pages

The center is the folder in which the files are located.

1000 because that should make up for statistical relevance.

With that I had a list of different document types:

Then I used a OCR test thing I made a couple weeks ago with a few prompts with o3-mini to run a few files to see how well the geojson extraction of tesseract works on the files.

Well, and a few minutes ago I finally got spacy working (stupid en_core_web_sm model *Ā§%Ā§$&Ā§%&ā€œā€! - that really took me some time to install)

But NER seems to work now.

Data preparation is the main taskā€¦

Iā€™ll make a CustomGPT soon. This way everyone can chat with the files.


Why did I do this? This looks overengineered af!

There is a pdf with 3000+ pages and the pipeline can easily handle that.

ahh and what are the other circles around that file (with the NER extraction)

Here is the prompt that I am using atm:

You are an intelligence analyst reviewing newly released JFK documents.  
Your task is to analyze the given page and evaluate how likely it is to provide new or previously unknown insights related to key unresolved questions about the JFK assassination.

Return the following as a single JSON object. 
Which must include the extracted text in full, the document_type and likelyhood of the 20 categories that we want to analyse.

Each point should receive a likelihood score between 1 (no relevance) and 100 (high relevance). Also fill a "highlights" field with stuff that stands out.

Something that is worth 100 would be a clear sentence like Oswald was in Mexico or he wasn't alone.

If there is no really interesting and new stuff e.g. something that might be important in the greater scheme only like manuals or descriptions or other burocratic stuff then it would be more in the 3 to 5 range. 20 would already mean there is some sort of new information clearly described. And the sentence needs to be repreted in highlights.

We have ~80k documents. There can't be more than 20 highlights that are really highlights. So the probability that something is very very very improtant is very low.

Expected output format:

{
  "text": "[Original input text, truncated to 500 characters]",
  "document_type": "[Same as input]",
  "Unknown contacts or meetings": [1-100],
  "Secret agreements": [1-100],
  "Unpublished speeches or writings": [1-100],
  "Unknown foreign operations": [1-100],
  "Covert operations": [1-100],
  "Financial transactions": [1-100],
  "Security threats": [1-100],
  "Internal conflicts": [1-100],
  "Unknown health issues": [1-100],
  "Connections to the Mafia": [1-100],
  "Media strategies": [1-100],
  "Unreleased photos or videos": [1-100],
  "Reactions to international crises": [1-100],
  "Personal correspondence": [1-100],
  "Unknown legislative initiatives": [1-100],
  "Relationships with other heads of state": [1-100],
  "Secret surveillance or wiretapping": [1-100],
  "Unspoken speeches": [1-100],
  "Internal polling or public opinion": [1-100],
  "Unknown notes or annotations": [1-100],
  "highlights": "[Key quotes, names, dates, or findings]"
}

Would really appreciate some help in terms of ā€œwhat am I even looking for???ā€

Have no clue - except for that I know how to grab whatever information anyone wants to look for.

Here is a little diagram that shows the overall process.

flowchart LR

%% =============== Local PDF Collection ============
subgraph LocalPdfCollection["<div style='background-color:black; color:yellow; font-weight:bold; padding:6px;'>Local PDF Collection</div>"]
    direction TB
    A[FolderWithPdfFiles]
end

%% =============== Minio ============
subgraph Minio["<div style='background-color:black; color:yellow; font-weight:bold; padding:6px;'>Minio</div>"]
    direction TB
    B[Minio Bucket: 'incoming']
    C[PUT Event => AMQP message]
end

%% =============== RabbitMQ ============
subgraph RabbitMQ["<div style='background-color:black; color:yellow; font-weight:bold; padding:6px;'>RabbitMQ</div>"]
    direction TB
    D[Exchange]
    E[Queue: 'IncomingQueue']
end

%% =============== n8n ============
subgraph n8n["<div style='background-color:black; color:yellow; font-weight:bold; padding:6px;'>n8n</div>"]
    direction TB
    F[Workflow controlling data flow]
end

%% =============== Python Services ============
subgraph PythonServices["<div style='background-color:black; color:yellow; font-weight:bold; padding:6px;'>Python Services</div>"]
    direction TB
    P1[Python Service: PDF ā†’ PNG]
    P2[Python Service: Analysis & Transform]
end

%% =============== Analysis Part ============
subgraph AnalysisPart["<div style='background-color:black; color:yellow; font-weight:bold; padding:6px;'>Analysis Part</div>"]
    direction TB
    O1[Tesseract OCR]
    O2[OpenAI image analysis]
    O3[Spacy NER]
    O4[PostGIS grouping on PostgreSQL]
    O5[Cypher in Neo4j]
    O1 --> O2
    O2 --> O3
    O3 --> O4
    O4 --> O5
end

%% =============== Faiss Index ============
subgraph FaissIndex["<div style='background-color:black; color:yellow; font-weight:bold; padding:6px;'>Faiss Index</div>"]
    direction TB
    FX[Faiss for embeddings]
end

%% =============== RAG Chat ============
subgraph RagChat["<div style='background-color:black; color:yellow; font-weight:bold; padding:6px;'>RAG Chat</div>"]
    direction TB
    RC[Python Service: GPT-based RAG Chat]
end

%% =============== Databases ============
subgraph Databases["<div style='background-color:black; color:yellow; font-weight:bold; padding:6px;'>Databases</div>"]
    direction TB
    PG[PostgreSQL]
    N4J[Neo4j]
end

%% Flows
A --> B
B --> C
C --> D
D --> E
E --> F
F --> P1
F --> P2
P1 --> B

P2 --> AnalysisPart
AnalysisPart <--> PG
AnalysisPart <--> N4J

AnalysisPart --> FX
FX <--> RC
N4J --> RC
7 Likes

Me thinks youā€™re gonna win by default! :wink:

Iā€™ll take a closer look a bit later when I have some more time.

Looks impressive, though! Maybe look for Umbrella Guy? Grassy knoll?

4 Likes

The simple keyword searches wouldnā€™t need AI. Just OCR and sed + awk.. could probably do that with a oneliner on cli.

Hey, letā€™s ask chatgpt to solve this with one line of code :rofl:

update

for f in *.pdf; do 
  echo "Processing: $f"
  pdftoppm "$f" temp_page -png && \
  for img in temp_page-*.png; do 
    tesseract "$img" stdout 2>/dev/null | grep -i "lee harvey oswald" && echo "  Match in $f (page ${img#*-})"
  done
  rm temp_page-*.png
done

what I am looking for is stuff like ā€œwhy was this stupid looking command required to be signed by a cia agent with a very high positionā€ā€¦

unusual stuffā€¦

*ok technically OCR is also AI :face_with_peeking_eye:

2 Likes

The system is highly dynamic. Can be used for medical research, financial analysis, trading,.. pretty sure it will solve some real world problems.

5 Likes

I find the method especially interesting, the time is certainly not wasted, even if the topic doesnā€™t interest everyone, itā€™s the technique and the knowledge what then will be always there. If one had to do it manually, it would definitely be too much work and time, so why not use an LLM for what it can do. There are also 1 or 2 people using their GPUs to kill zombiesā€¦
Iā€™ve wanted to do this kind of data analysis for different things for decades. Now, finally, itā€™s possible.

3 Likes

The real challenge is finding the right problems and the connections to themā€¦

This is indeed the point I have been trying to get over with my macrosā€¦ And indeed what OpenAI have been implementing:

In the age of AI everything is a one linerā€¦ The Challenge instead becomes finding the right problems and unravelling them from the mess they are in.

I guess what Iā€™m suggesting is that a real challenge might not be showing how to delve into an abstract dataset but linking up datasets with real-world positive impact use-cases.

I mean I hope this isnā€™t seen as ā€˜self-promotionalā€™ in any way, maybe this is a longer term plan of OpenAI GPTs or maybe I missed something elseā€¦

But delving into HS Codes and Tariffs data or Climate Data or Conflict Data, maybe these types of projects are too challenging for the OpenAI forum but opening up real world data sources alongside the easy code, isnā€™t that where progress starts.

I donā€™t come here to knock what you are doing but to challenge it. These are the sorts of things I expected on this forum when I considered it long before OpenAI existed.

While not all ā€˜Developersā€™ might be coders, multi-faceted ā€˜projectsā€™ have the potential to engage a more diverse audience with OpenAI API providing the backbone to pull it all togetherā€¦ Make it modular, use Leader, Regular and Member offered talents and time and have OpenAI back it with compute if itā€™s something they can backā€¦ It fits OpenAI goals of promoting their API right?

And yes I use my GPUs to kill zombies but more importantly socializeā€¦ While I use my brain to think about stuffā€¦ Back when I started gaming online I set my GPU to problems like this though soon realized that I was raindrops in the seaā€¦ People firstā€¦ Invest in skills and compute)

always thought there was never an assasin, always thought it was the secret service agent dropping his rifle 20 odd metre away from the car and the whole things was a cover up to hide the incompetence and preserve face internationally.

1 Like

Who knows but maybe they just mined bitcoin with your GPU.

They didnā€™t stop and trained LLMs ā†’ and greedy Zucky want to use user phones :grin: (for the greater good :money_mouth_face:)

That would have been smart, this was 5 years before Bitcoin :smiley:

I am sure that bitcoin because of its useless energy consumption is directly responsible for millions of climate change related deaths in the upcoming famines.

From my perspective there is nothing smart in it. It was and is and will ever be a scam scheme.

However - there is one thing worse than that: posting bullshit in a thread about jfk files.

Why donā€™t you post your own ideas on how to solve it technically or ask for the deletion of your account?

I truly believe there are exactly two types of posts acceptable in this community:

  1. posts about technical solutions, problems and help in that regard
  2. domain specific support (e.g. when there is something like this jfk file analysis you can add your domain specific knowledge about the topic)

Yo wtf. Seriously what is wrong with you?

Iā€™m sorry Jochen, like I said I wasnā€™t trying to knock what you were doing, Ijust thought it would be nice to see a practical solution from your expertise.

I only ever used one forum before and members and regulars would post challenging ideas and the leaders would discuss them with the members.

This is more of a top down forum I think, there were only Leaders, Regulars and 2 blocked members posts in the thread so I joined in.

I read the ā€˜(Unofficial) Weekend Projectā€™ slash Hackathon as thatā€¦So posted some ideas above.

I did contribute with the ChatGPT PDF Image Translator aboveā€¦

Then send me a private message or open a new topic like ā€œmy 2 cents about the hackathon threadā€

Well and everyone elseā€™s of courseā€¦ Weekend Projectsā€¦

Yes, of course everyone else. Why would you start an offtopic discussion on everything that pops up?

Or even better search for a post that someone started that relates to what you want to say. How about an own category for guys like you with the title ā€œunrelated stuffā€?

And they kicked you out too?

The title suggests this is on topic

Weekend Project Hackathon 1:

Are these top down decided projects or user contributed forum discussion?

Historically I have been encouraged to start threads like thisā€¦ Now youā€™re saying Iā€™m not welcome to join in?