Using gpt-4 API to Semantically Chunk Documents

SomebodySysop · April 13, 2024, 6:01am

So, where is where I am so far. I’ve come up with an instruction prompt and an output that gives me a JSON file that theoretically will let me create my sub-chunks. You will note that the main difference between this and @sergeliatko 's output is that his will include the content already chunked. For my solution to do that, you’ll need to be working with a context window of 1M tokens (assuming you are working with fairly large documents).

But, as I’ve stated before, I see my solution as more of a “last mile” effort to semantically sub-chunk the chunks that have initially been semantically chunked – somehow I need to figure out a better way to say that.

Task: Analyze the following document and generate a nested JSON representation of its hierarchical structure.

Document:

(Paste the full text of the document here)

Output Format:

The JSON output should be an array of objects, where each object represents a segment of the document’s structure and has the following properties:

title: The title of the segment, including any preceding numbering or lettering (e.g., “I. Preamble”, “A. Recognition”).
title_path: The full hierarchical path to the segment, with each level separated by " - " (e.g., “III. General Provisions - 1. Recognition and Scope of Agreement”). The title_path property should indeed include the full hierarchical path, incorporating the titles of all parent segments.
level: The level of the segment in the hierarchy (1 for main sections, 2 for subsections, etc.).
has_children: “Y” if the segment has child segments, “N” otherwise.
token_count: The approximate number of tokens in the segment’s text.
first_few_words: The first few words of the segment’s text (at least 5 words).
last_few_words: The last few words of the segment’s text (at least 5 words).
children (optional): If the segment has child segments, this property should contain a nested array of child segment objects following the same format.

Additional Instructions:

Use a reliable tokenization method to determine token counts.
Ensure accurate identification of segment boundaries based on the document’s headings and structure.
Pay attention to compound words and special characters during tokenization.
Maintain the hierarchical relationships between segments by nesting child elements within their parent segments.

Example Output:

(Provide a small example of the expected JSON output, similar to the one in the previous response)

	[
	  {
		"title": "I. Preamble",
		"title_path": "I. Preamble", 
		"level": 1,
		"has_children": "N",
		"token_count": 23,
		"first_few_words": "PRODUCER – SAG-AFTRA CODIFIED BASIC AGREEMENT",
		"last_few_words": "\"Producer\" and collectively referred to as \"Producers.\""
	  },
	  {
		"title": "II. Witness Eth",
		"title_path": "II. Witness Eth", 
		"level": 1,
		"has_children": "N",
		"token_count": 13,
		"first_few_words": "WITNESSETH: In consideration of the mutual",
		"last_few_words": "agreements hereinafter contained, it is agreed as follows:"
	  },
	  {
		"title": "III. General Provisions",
		"title_path": "III. General Provisions",
		"level": 1,
		"has_children": "Y",
		"token_count": 15432,
		"first_few_words": "1. RECOGNITION AND SCOPE OF AGREEMENT",
		"last_few_words": "to coding; It shall study and review the appropriateness of the Section relating to per diem rates;",
		"children": [
		  {
			"title": "1. Recognition and Scope of Agreement",
			"title_path": "III. General Provisions - 1. Recognition and Scope of Agreement", 
			"level": 2,
			"has_children": "Y",
			"token_count": 438,
			"first_few_words": "A. Recognition The Union is recognized",
			"last_few_words": "Part II shall apply to background actors employed in the New York Zone.", 
			"children": [
			  {
				"title": "A. Recognition",
				"title_path": "III. General Provisions - 1. Recognition and Scope of Agreement - A. Recognition", 
				"level": 3,
				"has_children": "N",
				"token_count": 287,
				"first_few_words": "The Union is recognized by Producer as",
				"last_few_words": "and body doubles. Background actors are not considered \"performers.\""
			  },
			  {
				"title": "B. Scope", 
				"title_path": "III. General Provisions - 1. Recognition and Scope of Agreement - B. Scope", 
				"level": 3,
				"has_children": "N",
				"token_count": 151,
				"first_few_words": "(1) When Producer has its base of",
				"last_few_words": "Las Vegas, Sacramento, San Diego, San Francisco and Hawaii Zones. Only the provisions"
			  }
			]
		  }
		  // ... (and so on for other children) ... 
		] 
	  } 
	  // ...
	]

I’ve not fully tested yet, and I imagine there might be even further changes. but this continues along my current thought path of making this work with a single API call, followed up by local code that will read the JSON and chunk the document.

Topic		Replies	Views
New 4-turbo model has a unique limit? Or is this a bizarre hallucation? API	18	4673	January 26, 2024
Building first RAG system API	17	2592	July 6, 2025
Preparing data for embedding API	33	15651	December 16, 2023
The length of the embedding contents API	48	35686	December 13, 2023
Poor quality response on trained LLM with pdf files Community gpt-4	29	7051	May 1, 2024

Using gpt-4 API to Semantically Chunk Documents

Related topics