[help] PDF Parsing strategies (maintaining context)

pmshadow · July 4, 2023, 11:16pm

Hi, I am trying to parse several pdf files attempting to maintain the overall structure of the file for context. Such as “Headers”, “Subheaders”, “normal text” etc.

I have thoroughly searched for python libraries / examples, but no library is ‘perfect’.

I am interested in extracting text from pdf files such as financial releases from companies.
example 1: “easy”: https://api.mziq.com/mzfilemanager/v2/d/c8182463-4b7e-408c-9d0f-42797662435e/b3afaa43-7b25-1c5d-20de-16d7924c0200?origin=1

example 2: hard: https://api.mziq.com/mzfilemanager/v2/d/13154776-9416-4fce-8c46-3e54d45b03a3/d3355eff-7730-5a46-193b-5acb976033a8?origin=1

Anyone with the “gold standard” in processing these sort of files that could help a brother out?

Thanks,

EricGT · July 4, 2023, 11:36pm

Define perfect.

Did you search this site for PDF?

I know there is one topic that is very similar to this and has a few suggestions, one of which I noted.

See:

Accurately read PDF files?

agsillvaa · July 5, 2023, 12:40am

Olá! Boa noite.

Tudo bem com você?

Sua ideia parece ser bastante interessante.

Detalhe mais sobre ela, pois não consegui entendê-la integralmente.

Abraços,
agsillvaa

pmshadow · July 5, 2023, 2:24am

Perfect in a way that it preserves a hierarchy of elements on the text.
For instance, a book chapter named “how to find the answer for problem x” is probably more important than a line “how to find the answer for problem x”.

Thanks for sending the other topics, I had seen it but was looking for something “fresher”

pmshadow · July 5, 2023, 2:26am

Ve meu texto aqui em baixo. Basicamente é alguma forma de processar pdf que mantenha a estrutura geral do texto e seja a “recomendada”.

Se a gente desconsidera totalmente a estrutura do texto perde muita informação. Abs

Topic		Replies	Views
Document processing solutions API chatgpt , plugin-development , api , assistants-api	6	4341	April 3, 2024
Accurately read PDF files? API	12	78960	December 12, 2023
Search long pdf for specific table - possibly need fine tuning model API gpt-4 , fine-tuning , api	10	3049	March 29, 2024
Efficiently Interacting with super super Long PDFs/documents API gpt-4	2	1434	June 25, 2024
A tool to break down books into refrencable chunks? Community chatgpt , api , custom-instructions	1	476	December 26, 2023

[help] PDF Parsing strategies (maintaining context)

Related topics