Using gpt-4 API to Semantically Chunk Documents

rejojer · April 1, 2025, 12:58pm

We’ve been following this discussion on semantic chunking with great interest — we’ve encountered similar challenges ourselves.

To address them, we developed a more robust tool for semantic chunking called PageIndex, which we’ve open-sourced (GitHub repo: github.com/VectifyAI/PageIndex). PageIndex is a hierarchical indexing system that transforms large documents — such as financial reports, regulatory filings, or textbooks — into semantic trees optimized for reasoning-based RAG.

Key features:

• Hierarchical Structure: Organizes lengthy PDFs into LLM-friendly trees — like a smart table of contents.

• Precise Referencing: Each node includes a summary and exact physical page numbers.

• Natural Segmentation: Nodes align with document sections, preserving context — no arbitrary chunking.

In addition to semantic chunking, we also found that such semantic trees can be used for reasoning-based RAG, which uses LLMs to navigate the tree to identify relevant nodes, rather than relying solely on semantic similarity as typical RAG systems do with vector DBs. We’ve used PageIndex for financial document analysis with reasoning-based RAG and saw significant improvements in retrieval accuracy (98.7% on FinanceBench), without using Vector DBs.

Would love any feedback — especially thoughts on reasoning-based RAG, or ideas for where PageIndex could be applied!

Topic		Replies	Views
Document Sections: Better rendering of chunks for long documents Prompting vector-db , semantic-search	66	31900	April 1, 2025
The length of the embedding contents API	48	34350	December 13, 2023
New 4-turbo model has a unique limit? Or is this a bizarre hallucation? API	18	4491	January 26, 2024
⬛ Splitting / Chunking Large input text for Summarisation (greater than 4096 tokens....) API	24	45324	December 12, 2023
Poor quality response on trained LLM with pdf files Community gpt-4	29	6298	May 1, 2024

Using gpt-4 API to Semantically Chunk Documents

Related topics