Using gpt-4 API to Semantically Chunk Documents

We’ve been following this discussion on semantic chunking with great interest — we’ve encountered similar challenges ourselves.

To address them, we developed a more robust tool for semantic chunking called PageIndex, which we’ve open-sourced (GitHub repo: github.com/VectifyAI/PageIndex). PageIndex is a hierarchical indexing system that transforms large documents — such as financial reports, regulatory filings, or textbooks — into semantic trees optimized for reasoning-based RAG.

Key features:

Hierarchical Structure: Organizes lengthy PDFs into LLM-friendly trees — like a smart table of contents.

Precise Referencing: Each node includes a summary and exact physical page numbers.

Natural Segmentation: Nodes align with document sections, preserving context — no arbitrary chunking.

In addition to semantic chunking, we also found that such semantic trees can be used for reasoning-based RAG, which uses LLMs to navigate the tree to identify relevant nodes, rather than relying solely on semantic similarity as typical RAG systems do with vector DBs. We’ve used PageIndex for financial document analysis with reasoning-based RAG and saw significant improvements in retrieval accuracy (98.7% on FinanceBench), without using Vector DBs.

Would love any feedback — especially thoughts on reasoning-based RAG, or ideas for where PageIndex could be applied!

2 Likes