Doctra is an open-source toolkit that turns PDFs into structured data using layout analysis, OCR, and Vision LMs (VLMs). It extracts text, tables, and charts/figures, then exports Markdown, HTML, and Excel. Includes a CLI, Python API, Gradio UI, and optional page restoration to boost accuracy on noisy scans.
Install
pip install doctra
Why Doctra?
Most PDFs aren’t designed for reuse. Doctra aims to make real-world documents (scans, multi-column layouts, noisy reports) immediately usable, no manual copy/paste. For low-quality scans, enable page restoration before OCR to improve downstream extraction.
What it does
-
Layout detection → segments pages
-
OCR → Tesseract
-
Vision LMs → interprets tables into charts into structured text/JSON
-
Page restoration (optional) → denoises/deblurs scanned pages
-
Exports → Markdown, HTML, Excel (one sheet per table)
-
Interfaces → CLI • Python API • Gradio UI
-
Providers → OpenAI, Gemini, Anthropic, OpenRouter, Ollama (more coming)
Example outputs
-
Tables → Markdown/HTML + Excel workbook
-
Charts → concise descriptions + structured data when recoverable
-
Text → clean Markdown/HTML (headings, paragraphs, lists)
Who is it for?
-
Data engineers turning reports into datasets
-
Researchers/analysts extracting tables/charts from papers
-
Back-office teams processing invoices, statements, contracts
-
Builders who want a quick CLI/UI and a Python API to integrate
Looking for feedback 
I’d love help pressure-testing the pipeline on:
-
Messy scans (skew, noise, low DPI)
-
Complex tables (merged headers, multi-line cells)
-
Charts/figures that should yield structured outputs
-
Ergonomics (CLI flags, Python options, Gradio UX)
If you can share non-sensitive PDFs (or synthetic examples), I’ll try them and report results. Issues/PRs also welcome!
If Doctra helps you, a star would mean a lot!
Note on data/privacy: Please avoid posting sensitive PDFs. Use providers consistent with their terms and your data policies.


