Accurately read PDF files?

Short answer:

Try


TL;DR

I know the link notes HTML5 but in reality it is much more than that.

See my related StackOverflow answer for a different question that gives more details.

Years ago I started parsing PDF files and quickly learned that they are not data files but programs based on PostScript and that the PDF file was better understood as an archive file, think zip file, that contained a directory of PostScript, resources, etc. Anyway half way through the project I found that some open source software would at least correctly extract all of the data I needed but that it still had to undergo a transformation step that I had to create.

Over the years it seems a true test of any PDF to text is getting tables to convert correctly. Remember that many of tables are not text but some form of PostScript and it seems there are as many ways to make a table using PostScript as there are there are beverage options in a grocery store.

3 Likes