So, where is where I am so far. I’ve come up with an instruction prompt and an output that gives me a JSON file that theoretically will let me create my sub-chunks. You will note that the main difference between this and @sergeliatko 's output is that his will include the content already chunked. For my solution to do that, you’ll need to be working with a context window of 1M tokens (assuming you are working with fairly large documents).
But, as I’ve stated before, I see my solution as more of a “last mile” effort to semantically sub-chunk the chunks that have initially been semantically chunked – somehow I need to figure out a better way to say that.
Task: Analyze the following document and generate a nested JSON representation of its hierarchical structure.
Document:
(Paste the full text of the document here)
Output Format:
The JSON output should be an array of objects, where each object represents a segment of the document’s structure and has the following properties:
- title: The title of the segment, including any preceding numbering or lettering (e.g., “I. Preamble”, “A. Recognition”).
- title_path: The full hierarchical path to the segment, with each level separated by " - " (e.g., “III. General Provisions - 1. Recognition and Scope of Agreement”). The title_path property should indeed include the full hierarchical path, incorporating the titles of all parent segments.
- level: The level of the segment in the hierarchy (1 for main sections, 2 for subsections, etc.).
- has_children: “Y” if the segment has child segments, “N” otherwise.
- token_count: The approximate number of tokens in the segment’s text.
- first_few_words: The first few words of the segment’s text (at least 5 words).
- last_few_words: The last few words of the segment’s text (at least 5 words).
- children (optional): If the segment has child segments, this property should contain a nested array of child segment objects following the same format.
Additional Instructions:
- Use a reliable tokenization method to determine token counts.
- Ensure accurate identification of segment boundaries based on the document’s headings and structure.
- Pay attention to compound words and special characters during tokenization.
- Maintain the hierarchical relationships between segments by nesting child elements within their parent segments.
Example Output:
(Provide a small example of the expected JSON output, similar to the one in the previous response)
[
{
"title": "I. Preamble",
"title_path": "I. Preamble",
"level": 1,
"has_children": "N",
"token_count": 23,
"first_few_words": "PRODUCER – SAG-AFTRA CODIFIED BASIC AGREEMENT",
"last_few_words": "\"Producer\" and collectively referred to as \"Producers.\""
},
{
"title": "II. Witness Eth",
"title_path": "II. Witness Eth",
"level": 1,
"has_children": "N",
"token_count": 13,
"first_few_words": "WITNESSETH: In consideration of the mutual",
"last_few_words": "agreements hereinafter contained, it is agreed as follows:"
},
{
"title": "III. General Provisions",
"title_path": "III. General Provisions",
"level": 1,
"has_children": "Y",
"token_count": 15432,
"first_few_words": "1. RECOGNITION AND SCOPE OF AGREEMENT",
"last_few_words": "to coding; It shall study and review the appropriateness of the Section relating to per diem rates;",
"children": [
{
"title": "1. Recognition and Scope of Agreement",
"title_path": "III. General Provisions - 1. Recognition and Scope of Agreement",
"level": 2,
"has_children": "Y",
"token_count": 438,
"first_few_words": "A. Recognition The Union is recognized",
"last_few_words": "Part II shall apply to background actors employed in the New York Zone.",
"children": [
{
"title": "A. Recognition",
"title_path": "III. General Provisions - 1. Recognition and Scope of Agreement - A. Recognition",
"level": 3,
"has_children": "N",
"token_count": 287,
"first_few_words": "The Union is recognized by Producer as",
"last_few_words": "and body doubles. Background actors are not considered \"performers.\""
},
{
"title": "B. Scope",
"title_path": "III. General Provisions - 1. Recognition and Scope of Agreement - B. Scope",
"level": 3,
"has_children": "N",
"token_count": 151,
"first_few_words": "(1) When Producer has its base of",
"last_few_words": "Las Vegas, Sacramento, San Diego, San Francisco and Hawaii Zones. Only the provisions"
}
]
}
// ... (and so on for other children) ...
]
}
// ...
]
I’ve not fully tested yet, and I imagine there might be even further changes. but this continues along my current thought path of making this work with a single API call, followed up by local code that will read the JSON and chunk the document.