Open sourcing our PDF parsing library

Most open source libraries for chunking documents into something that can be fed into OpenAI aren’t particularly good (they just split text into N character chunks).

OpenAI has their assistant’s API but this is a black box and it’s difficult to do advanced querying.

We’ve parsed millions of documents with our website and have decided to open source our solution so others can contribute. Hopefully this is useful for others!

Features:

  • File to markdown support
  • Tables to html
  • Bounding boxes for everything