Hi, I am trying to parse several pdf files attempting to maintain the overall structure of the file for context. Such as “Headers”, “Subheaders”, “normal text” etc.
I have thoroughly searched for python libraries / examples, but no library is ‘perfect’.
I am interested in extracting text from pdf files such as financial releases from companies.
example 1: “easy”: https://api.mziq.com/mzfilemanager/v2/d/c8182463-4b7e-408c-9d0f-42797662435e/b3afaa43-7b25-1c5d-20de-16d7924c0200?origin=1
example 2: hard: https://api.mziq.com/mzfilemanager/v2/d/13154776-9416-4fce-8c46-3e54d45b03a3/d3355eff-7730-5a46-193b-5acb976033a8?origin=1
Anyone with the “gold standard” in processing these sort of files that could help a brother out?
Thanks,
EricGT
2
Define perfect.
Did you search this site for PDF?
I know there is one topic that is very similar to this and has a few suggestions, one of which I noted.
See:
Accurately read PDF files?
1 Like
Olá! Boa noite.
Tudo bem com você?
Sua ideia parece ser bastante interessante.
Detalhe mais sobre ela, pois não consegui entendê-la integralmente.
Abraços,
agsillvaa
1 Like
Perfect in a way that it preserves a hierarchy of elements on the text.
For instance, a book chapter named “how to find the answer for problem x” is probably more important than a line “how to find the answer for problem x”.
Thanks for sending the other topics, I had seen it but was looking for something “fresher”
Ve meu texto aqui em baixo. Basicamente é alguma forma de processar pdf que mantenha a estrutura geral do texto e seja a “recomendada”.
Se a gente desconsidera totalmente a estrutura do texto perde muita informação. Abs