[help] PDF Parsing strategies (maintaining context)

Hi, I am trying to parse several pdf files attempting to maintain the overall structure of the file for context. Such as “Headers”, “Subheaders”, “normal text” etc.

I have thoroughly searched for python libraries / examples, but no library is ‘perfect’.

I am interested in extracting text from pdf files such as financial releases from companies.
example 1: “easy”: https://api.mziq.com/mzfilemanager/v2/d/c8182463-4b7e-408c-9d0f-42797662435e/b3afaa43-7b25-1c5d-20de-16d7924c0200?origin=1

example 2: hard: https://api.mziq.com/mzfilemanager/v2/d/13154776-9416-4fce-8c46-3e54d45b03a3/d3355eff-7730-5a46-193b-5acb976033a8?origin=1

Anyone with the “gold standard” in processing these sort of files that could help a brother out?

Thanks,

1 Like

Define perfect.


Did you search this site for PDF?

I know there is one topic that is very similar to this and has a few suggestions, one of which I noted.

See:

Accurately read PDF files?

1 Like

Olá! Boa noite.

Tudo bem com você?

Sua ideia parece ser bastante interessante.

Detalhe mais sobre ela, pois não consegui entendê-la integralmente.

Abraços,
agsillvaa

1 Like

Perfect in a way that it preserves a hierarchy of elements on the text.
For instance, a book chapter named “how to find the answer for problem x” is probably more important than a line “how to find the answer for problem x”.

Thanks for sending the other topics, I had seen it but was looking for something “fresher”

Ve meu texto aqui em baixo. Basicamente é alguma forma de processar pdf que mantenha a estrutura geral do texto e seja a “recomendada”.

Se a gente desconsidera totalmente a estrutura do texto perde muita informação. Abs