I like that. Very logical. But, I come from the “always use original text” school of thought. So, my biggest hurdle was being able to accurately identify the chunks in the hierarchy. I think I’ve done it.
JSON Output with Line Numbers (First Few Segments):
[
{
"title": "ARTICLE 11 - GRIEVANCE AND ARBITRATION RULES AND PROCEDURES",
"title_path": "ARTICLE 11 - GRIEVANCE AND ARBITRATION RULES AND PROCEDURES",
"level": 1,
"has_children": "Y",
"token_count": 3431,
"character_count": 18413,
"start_line": 1,
"end_line": 193
},
{
"title": "A. GENERAL RULES",
"title_path": "ARTICLE 11 - GRIEVANCE AND ARBITRATION RULES AND PROCEDURES - A. GENERAL RULES",
"level": 2,
"has_children": "Y",
"token_count": 3431,
"character_count": 18413,
"start_line": 3,
"end_line": 193
},
{
"title": "1. Parties",
"title_path": "ARTICLE 11 - GRIEVANCE AND ARBITRATION RULES AND PROCEDURES - A. GENERAL RULES - 1. Parties",
"level": 3,
"has_children": "Y",
"token_count": 234,
"character_count": 1239,
"start_line": 9,
"end_line": 21
},
{
"title": "a. In any grievance or arbitration concerning",
"title_path": "ARTICLE 11 - GRIEVANCE AND ARBITRATION RULES AND PROCEDURES - A. GENERAL RULES - 1. Parties - a. In any grievance or arbitration concerning",
"level": 4,
"has_children": "N",
"token_count": 32,
"character_count": 171,
"start_line": 11,
"end_line": 12
},
{
"title": "b. Except as provided in subparagraph a. above, only",
"title_path": "ARTICLE 11 - GRIEVANCE AND ARBITRATION RULES AND PROCEDURES - A. GENERAL RULES - 1. Parties - b. Except as provided in subparagraph a. above, only",
"level": 4,
"has_children": "N",
"token_count": 10,
"character_count": 64,
"start_line": 14,
"end_line": 14
},
{
"title": "d. The party commencing a claim in grievance or",
"title_path": "ARTICLE 11 - GRIEVANCE AND ARBITRATION RULES AND PROCEDURES - A. GENERAL RULES - 1. Parties - d. The party commencing a claim in grievance or",
"level": 4,
"has_children": "N",
"token_count": 18,
"character_count": 97,
"start_line": 17,
"end_line": 18
}
// ... (and so on for other segments) ...
]
content_copyUse code with caution.Json
This output includes the first few segments of the 13_Article_11_1.txt document with the following properties:
-
title: The segment title, including preceding numbering/lettering.
-
title_path: Full hierarchical path to the segment.
-
level: Level in the hierarchy.
-
has_children: Whether the segment has child segments.
-
token_count: Approximate number of tokens in the segment.
-
character_count: Number of characters in the segment.
-
start_line: Line number where the segment begins.
-
end_line: Line number where the segment ends.
Please note that the token and character counts are approximate and may vary slightly depending on the specific tokenization and counting methods used.
delete
User
This does not work. You are not accurately reporting the line numbers. But, I have another idea.
docs article13lines.txt 21908 Tokens
This file contains the same document, only with linenoxxxx: prepended to each line. Now, you can report back the beginning and ending lineno values for each segment instead of trying to calculate the actual line numbers. Run another test so I can see what you come back with. Remember that in your hierarchy evaluation, you ignore “linenoxxxx:” You only use it to report back the actual begin and end line numbers for each segment.
Model
Addressing Line Number Issues and New Approach:
You’re absolutely right; I apologize for the inaccuracies in reporting line numbers previously.
The new approach you suggest, using a file with prepended line markers (“linenoxxxx:”), is a brilliant solution! This will allow me to accurately identify the starting and ending lines of each segment without relying on potentially error-prone line counting methods.
Here’s how I will adapt the process:
-
Process the File with Line Markers: I will analyze the provided file article13lines.txt and extract the hierarchical structure as before, ignoring the “linenoxxxx:” prefixes.
-
Extract Line Numbers: For each segment, I will extract the “linenoxxxx:” values from the start and end lines of the segment within the file.
-
Include Line Numbers in JSON: The extracted line number values will be included in the JSON output as the “start_line” and “end_line” properties, respectively.
Here’s an example of the expected JSON output for the first few segments using this approach:
[
{
"title": "ARTICLE 11 - GRIEVANCE AND ARBITRATION RULES AND PROCEDURES",
"title_path": "ARTICLE 11 - GRIEVANCE AND ARBITRATION RULES AND PROCEDURES",
"level": 1,
"has_children": "Y",
"token_count": 3431,
"character_count": 18413,
"start_line": "line0001",
"end_line": "line0193"
},
{
"title": "A. GENERAL RULES",
"title_path": "ARTICLE 11 - GRIEVANCE AND ARBITRATION RULES AND PROCEDURES - A. GENERAL RULES",
"level": 2,
"has_children": "Y",
"token_count": 3431,
"character_count": 18413,
"start_line": "line0003",
"end_line": "line0193"
},
// ... (and so on for other segments) ...
]
content_copyUse code with caution.Json
I will now process the article13lines.txt file and provide you with the JSON output containing the accurate line number information.
So, for me, it is now a 3 step process:
- export the pdf (or whatever) document to txt.
- run a php program to prepend linenoxxxx:
- send this file to model along with instructions to create hierarchy json
- run a php program on model output to create the chunks.
While that may seem like a lot, it’s only one API call and can all be automated with code. And, it is the MOST efficient and accurate way to extract actual chunks from the actual document (we are talking LARGE documents here) that anyone has come up with so far.
I am now going to test this out.