OK, this is what I have so far:
Here is a document I am testing with:
https://labor.booksai.org/sites/labor.booksai.org/files/002_SAGAFTRA%20General%20Provisions_Part1_legal_0.pdf – note the 1. 2. 3. Etc. sections
This document is 64,707 tokens, 298785 characters and 139 pages.
I export the source document to .txt file and send it to model.
Prompt:
Please give me an outline of the hierarchal structure of this document.
Return a json array of the outline with each outline segment formatted like this:
title (actual title - include preceding letter, number or roman numeral if applicable)
title path (the hierarchal title path = level 1 title - level 2 title - etc… including this title)
token count (total tokens for this segment)
first few words (first few words that begin the segment - this should be at least 5)
last few words (the last few words of the segment - this should be at least 5)
I asked it to just analyze the first few lines:
[
{
"title": "Preamble",
"title_path": "Preamble",
"token_count": 23,
"first_few_words": "PRODUCER – SAG-AFTRA CODIFIED BASIC AGREEMENT",
"last_few_words": "referred to as \"Producers.\""
},
{
"title": "Witness Eth",
"title_path": "Witness Eth",
"token_count": 14,
"first_few_words": "In consideration of the mutual agreements",
"last_few_words": "agreed as follows:"
},
{
"title": "General Provisions",
"title_path": "General Provisions",
"token_count": 15432,
"first_few_words": "1. RECOGNITION AND SCOPE OF AGREEMENT",
"last_few_words": "standards at all studios with respect to the employment of performers;"
},
{
"title": "Recognition and Scope of Agreement",
"title_path": "General Provisions - Recognition and Scope of Agreement",
"token_count": 438,
"first_few_words": "A. Recognition The Union is recognized",
"last_few_words": "provisions of Schedule X, Part II shall apply to background actors employed in the New York Zone."
},
{
"title": "Recognition",
"title_path": "General Provisions - Recognition and Scope of Agreement - Recognition",
"token_count": 287,
"first_few_words": "The Union is recognized by Producer as",
"last_few_words": "Background actors are not considered \"performers.\""
},
{
"title": "Scope",
"title_path": "General Provisions - Recognition and Scope of Agreement - Scope",
"token_count": 151,
"first_few_words": "(1) When Producer has its base of",
"last_few_words": "of Schedule X, Part I shall apply to the employment of background actors in the Los Angeles, Las Vegas, Sacramento, San Diego, San Francisco and Hawaii Zones."
},
{
"title": "Union Security",
"title_path": "General Provisions - Union Security",
"token_count": 785,
"first_few_words": "A. Every performer hereafter employed by",
"last_few_words": "to arbitration between the Union and the Producer under Section 9 of these General Provisions."
},
{
"title": "Strikes",
"title_path": "General Provisions - Strikes",
"token_count": 1406,
"first_few_words": "A. No-Strike Clause The Union agrees",
"last_few_words": "action."
},
{
"title": "No-Strike Clause",
"title_path": "General Provisions - Strikes - No-Strike Clause",
"token_count": 31,
"first_few_words": "The Union agrees that, during the",
"last_few_words": "signatory hereto."
},
{
"title": "Rights and Duties of Union Members and Producers",
"title_path": "General Provisions - Strikes - Rights and Duties of Union Members and Producers",
"token_count": 1375,
"first_few_words": "(1) If, after the expiration or other",
"last_few_words": "Producer and Union."
}
]
content_copyUse code with caution.Json
Explanation:
-
The output is a JSON array containing objects, each representing a segment from the hierarchical outline.
-
Each object has the following properties:
-
“title”: The actual title of the segment.
-
“title_path”: The hierarchical path to the segment, including the titles of all parent sections.
-
“token_count”: The total number of tokens in the segment’s text.
-
“first_few_words”: The first few words of the segment’s text.
-
“last_few_words”: The last few words of the segment’s text.
Please note that the token count is an approximation and might vary slightly depending on the specific tokenization method used.
This test demonstrates the feasibility of the approach and provides a structured way to access information about each segment, including its title, location in the hierarchy, size, and a preview of its content.
After a few adjustments, this is now what I am getting:
[
{
"title": "I. Preamble",
"title_path": "I. Preamble",
"token_count": 23,
"first_few_words": "PRODUCER – SAG-AFTRA CODIFIED BASIC AGREEMENT",
"last_few_words": "\"Producer\" and collectively referred to as \"Producers.\""
},
{
"title": "II. Witness Eth",
"title_path": "II. Witness Eth",
"token_count": 14,
"first_few_words": "In consideration of the mutual agreements",
"last_few_words": "hereinafter contained, it is agreed as follows:"
},
{
"title": "III. General Provisions",
"title_path": "III. General Provisions",
"token_count": 15432,
"first_few_words": "1. RECOGNITION AND SCOPE OF AGREEMENT",
"last_few_words": "to coding; It shall study and review the appropriateness of the Section relating to per diem rates;"
},
{
"title": "1. Recognition and Scope of Agreement",
"title_path": "III. General Provisions - 1. Recognition and Scope of Agreement",
"token_count": 438,
"first_few_words": "A. Recognition The Union is recognized",
"last_few_words": "Part II shall apply to background actors employed in the New York Zone."
},
{
"title": "A. Recognition",
"title_path": "III. General Provisions - 1. Recognition and Scope of Agreement - A. Recognition",
"token_count": 287,
"first_few_words": "The Union is recognized by Producer as",
"last_few_words": "and body doubles. Background actors are not considered \"performers.\""
},
{
"title": "B. Scope",
"title_path": "III. General Provisions - 1. Recognition and Scope of Agreement - B. Scope",
"token_count": 151,
"first_few_words": "(1) When Producer has its base of",
"last_few_words": "Las Vegas, Sacramento, San Diego, San Francisco and Hawaii Zones. Only the provisions"
},
{
"title": "2. Union Security",
"title_path": "III. General Provisions - 2. Union Security",
"token_count": 785,
"first_few_words": "A. Every performer hereafter employed by",
"last_few_words": "breach of the provisions of this Section shall be subject to arbitration between the Union and the Producer"
},
{
"title": "3. Strikes",
"title_path": "III. General Provisions - 3. Strikes",
"token_count": 1406,
"first_few_words": "A. No-Strike Clause The Union agrees",
"last_few_words": "proceedings in accordance with its by-laws against the participants in such unauthorized"
},
{
"title": "A. No-Strike Clause",
"title_path": "III. General Provisions - 3. Strikes - A. No-Strike Clause",
"token_count": 31,
"first_few_words": "The Union agrees that, during the",
"last_few_words": "strike affecting motion picture production against any Producer signatory hereto."
},
{
"title": "B. Rights and Duties of Union Members and Producers",
"title_path": "III. General Provisions - 3. Strikes - B. Rights and Duties of Union Members and Producers",
"token_count": 1375,
"first_few_words": "(1) If, after the expiration or other",
"last_few_words": "strike, unless this covenant be sooner terminated by the written consent of Producer and Union."
}
]
Then, I realized I needed to add the levels as well:
Adding Hierarchy Levels to the JSON:
[
{
"title": "I. Preamble",
"title_path": "I. Preamble",
"level": 1,
"token_count": 23,
"first_few_words": "PRODUCER – SAG-AFTRA CODIFIED BASIC AGREEMENT",
"last_few_words": "\"Producer\" and collectively referred to as \"Producers.\""
},
{
"title": "II. Witness Eth",
"title_path": "II. Witness Eth",
"level": 1,
"token_count": 13,
"first_few_words": "WITNESSETH: In consideration of the mutual",
"last_few_words": "agreements hereinafter contained, it is agreed as follows:"
},
{
"title": "III. General Provisions",
"title_path": "III. General Provisions",
"level": 1,
"token_count": 15432,
"first_few_words": "1. RECOGNITION AND SCOPE OF AGREEMENT",
"last_few_words": "to coding; It shall study and review the appropriateness of the Section relating to per diem rates;"
},
{
"title": "1. Recognition and Scope of Agreement",
"title_path": "III. General Provisions - 1. Recognition and Scope of Agreement",
"level": 2,
"token_count": 438,
"first_few_words": "A. Recognition The Union is recognized",
"last_few_words": "Part II shall apply to background actors employed in the New York Zone."
},
{
"title": "A. Recognition",
"title_path": "III. General Provisions - 1. Recognition and Scope of Agreement - A. Recognition",
"level": 3,
"token_count": 287,
"first_few_words": "The Union is recognized by Producer as",
"last_few_words": "and body doubles. Background actors are not considered \"performers.\""
},
// ... (and so on for other segments, with appropriate level values) ...
]
content_copyUse code with caution.Json
I’ve added a “level” property to each object, indicating the segment’s level in the hierarchy. The levels are assigned as follows:
-
Level 1: Main sections (Preamble, Witness Eth, General Provisions, Exhibits, Letter Agreement)
-
Level 2: Numbered sections within General Provisions (1., 2., 3., etc.)
-
Level 3: Lettered subsections within numbered sections (A., B., C., etc.)
-
(And so on for further nesting, if present)
This enhanced JSON output provides a more complete representation of the document’s structure, including the hierarchical relationships and level of each segment, along with the other helpful information.
So, there are still some organizational issues to work out, but not bad for the initial execution of the idea.