Doctra : OpenAI-powered PDF parsing to Markdown, HTML & Excel

Doctra is an open-source toolkit that turns PDFs into structured data using layout analysis, OCR, and Vision LMs (VLMs). It extracts text, tables, and charts/figures, then exports Markdown, HTML, and Excel. Includes a CLI, Python API, Gradio UI, and optional page restoration to boost accuracy on noisy scans.

Install

pip install doctra

Why Doctra?

Most PDFs aren’t designed for reuse. Doctra aims to make real-world documents (scans, multi-column layouts, noisy reports) immediately usable, no manual copy/paste. For low-quality scans, enable page restoration before OCR to improve downstream extraction.


What it does

  • Layout detection → segments pages

  • OCR → Tesseract

  • Vision LMs → interprets tables into charts into structured text/JSON

  • Page restoration (optional) → denoises/deblurs scanned pages

  • Exports → Markdown, HTML, Excel (one sheet per table)

  • Interfaces → CLI • Python API • Gradio UI

  • ProvidersOpenAI, Gemini, Anthropic, OpenRouter, Ollama (more coming)


Example outputs

  • Tables → Markdown/HTML + Excel workbook

  • Charts → concise descriptions + structured data when recoverable

  • Text → clean Markdown/HTML (headings, paragraphs, lists)


Who is it for?

  • Data engineers turning reports into datasets

  • Researchers/analysts extracting tables/charts from papers

  • Back-office teams processing invoices, statements, contracts

  • Builders who want a quick CLI/UI and a Python API to integrate


Looking for feedback :sparkles:

I’d love help pressure-testing the pipeline on:

  • Messy scans (skew, noise, low DPI)

  • Complex tables (merged headers, multi-line cells)

  • Charts/figures that should yield structured outputs

  • Ergonomics (CLI flags, Python options, Gradio UX)

If you can share non-sensitive PDFs (or synthetic examples), I’ll try them and report results. Issues/PRs also welcome!

  • :star: If Doctra helps you, a star would mean a lot!

Note on data/privacy: Please avoid posting sensitive PDFs. Use providers consistent with their terms and your data policies.

question (as someone that has built something like this a couple of times for different projects): how do you deal with images on the pdf? do you ignore them or do you send them to a vision model?

Good luck, I’ll even give you a sliced page for consumption, larger than the input of any vision model!


Here’s paid competition with a head start - but not with the power of AI Conversion Gallery | Mathpix.

We analyze the page layout to classify each image block as a table, a chart, or a generic figure. tables and charts are sent to a vision model to recover structure and data (tables and charts to Markdown/HTML/Excel), while generic figures are exported to a figures folder

Appreciate the challenge! I’ll run Doctra on your sliced pages, then share the outputs here

1 Like

very cool, are you thinking of creating an MCP server for the chatgpt app store or is the plan to just keep it as a open source repo?

Not planning a dedicated MCP server for the ChatGPT app store right now. Doctra is intentionally provider-agnostic (OpenAI, Anthropic, Gemini, Ollama, OpenRouter), and I want to keep the core repo that way.

That said, I am going to expose Doctra via MCP using FastMCP.