Using gpt-4 API to Semantically Chunk Documents

So, where is where I am so far. I’ve come up with an instruction prompt and an output that gives me a JSON file that theoretically will let me create my sub-chunks. You will note that the main difference between this and @sergeliatko 's output is that his will include the content already chunked. For my solution to do that, you’ll need to be working with a context window of 1M tokens (assuming you are working with fairly large documents).

But, as I’ve stated before, I see my solution as more of a “last mile” effort to semantically sub-chunk the chunks that have initially been semantically chunked – somehow I need to figure out a better way to say that.

Task: Analyze the following document and generate a nested JSON representation of its hierarchical structure.

Document:

(Paste the full text of the document here)

Output Format:

The JSON output should be an array of objects, where each object represents a segment of the document’s structure and has the following properties:

  • title: The title of the segment, including any preceding numbering or lettering (e.g., “I. Preamble”, “A. Recognition”).
  • title_path: The full hierarchical path to the segment, with each level separated by " - " (e.g., “III. General Provisions - 1. Recognition and Scope of Agreement”). The title_path property should indeed include the full hierarchical path, incorporating the titles of all parent segments.
  • level: The level of the segment in the hierarchy (1 for main sections, 2 for subsections, etc.).
  • has_children: “Y” if the segment has child segments, “N” otherwise.
  • token_count: The approximate number of tokens in the segment’s text.
  • first_few_words: The first few words of the segment’s text (at least 5 words).
  • last_few_words: The last few words of the segment’s text (at least 5 words).
  • children (optional): If the segment has child segments, this property should contain a nested array of child segment objects following the same format.

Additional Instructions:

  • Use a reliable tokenization method to determine token counts.
  • Ensure accurate identification of segment boundaries based on the document’s headings and structure.
  • Pay attention to compound words and special characters during tokenization.
  • Maintain the hierarchical relationships between segments by nesting child elements within their parent segments.

Example Output:

(Provide a small example of the expected JSON output, similar to the one in the previous response)

	[
	  {
		"title": "I. Preamble",
		"title_path": "I. Preamble", 
		"level": 1,
		"has_children": "N",
		"token_count": 23,
		"first_few_words": "PRODUCER – SAG-AFTRA CODIFIED BASIC AGREEMENT",
		"last_few_words": "\"Producer\" and collectively referred to as \"Producers.\""
	  },
	  {
		"title": "II. Witness Eth",
		"title_path": "II. Witness Eth", 
		"level": 1,
		"has_children": "N",
		"token_count": 13,
		"first_few_words": "WITNESSETH: In consideration of the mutual",
		"last_few_words": "agreements hereinafter contained, it is agreed as follows:"
	  },
	  {
		"title": "III. General Provisions",
		"title_path": "III. General Provisions",
		"level": 1,
		"has_children": "Y",
		"token_count": 15432,
		"first_few_words": "1. RECOGNITION AND SCOPE OF AGREEMENT",
		"last_few_words": "to coding; It shall study and review the appropriateness of the Section relating to per diem rates;",
		"children": [
		  {
			"title": "1. Recognition and Scope of Agreement",
			"title_path": "III. General Provisions - 1. Recognition and Scope of Agreement", 
			"level": 2,
			"has_children": "Y",
			"token_count": 438,
			"first_few_words": "A. Recognition The Union is recognized",
			"last_few_words": "Part II shall apply to background actors employed in the New York Zone.", 
			"children": [
			  {
				"title": "A. Recognition",
				"title_path": "III. General Provisions - 1. Recognition and Scope of Agreement - A. Recognition", 
				"level": 3,
				"has_children": "N",
				"token_count": 287,
				"first_few_words": "The Union is recognized by Producer as",
				"last_few_words": "and body doubles. Background actors are not considered \"performers.\""
			  },
			  {
				"title": "B. Scope", 
				"title_path": "III. General Provisions - 1. Recognition and Scope of Agreement - B. Scope", 
				"level": 3,
				"has_children": "N",
				"token_count": 151,
				"first_few_words": "(1) When Producer has its base of",
				"last_few_words": "Las Vegas, Sacramento, San Diego, San Francisco and Hawaii Zones. Only the provisions"
			  }
			]
		  }
		  // ... (and so on for other children) ... 
		] 
	  } 
	  // ...
	]

I’ve not fully tested yet, and I imagine there might be even further changes. but this continues along my current thought path of making this work with a single API call, followed up by local code that will read the JSON and chunk the document.

3 Likes