By “methodology”, I am referring to one of the methods described in the videos I posted here: Using gpt-4 API to Semantically Chunk Documents - #112 by SomebodySysop
- Level 1: Character Splitting - Simple static character chunks of data
- Level 2: Recursive Character Text Splitting - Recursive chunking based on a list of separators
- Level 3: Document Specific Splitting - Various chunking methods for different document types (PDF, Python, Markdown)
- Level 4: Semantic Splitting - Embedding walk based chunking
- Level 5: Agentic Splitting - Experimental method of splitting text with an agent-like system. Good for if you believe that token cost will trend to $0.00
- Bonus Level: Alternative Representation Chunking + Indexing - Derivative representations of your raw text that will aid in retrieval and indexing