OK, I have made considerable progress over the past few days. It hasn’t been easy. While the prompt I created for the API call returned the information I wanted, the output was too large for gpt-4-turbo-preview or gemini 1.5 pro. So, I had to modify my steps:
- export the pdf (or whatever) document to txt.
- run code to prepend linenoxxxx:
- send this numbered file to model along with instructions to create hierarchy json file
- process this file with code to add end_line numbers and output that json file.
- run code on json output to create the chunks.
So, here’s a little more detail on each step:
-
I export the source file (usually a pdf) to .txt format.
-
I place line numbers in the file because none of the models seem to be able to identify what line they are on, even in text files:
line0001:ARTICLE 11 - GRIEVANCE AND ARBITRATION RULES AND PROCEDURES
line0002:
line0003:A. GENERAL RULES
line0004:
line0005:Unless otherwise provided in this Article 11 or elsewhere in this Basic Agreement, the rules and procedures for grievance and arbitration shall be as follows:
line0006:
line0007:1. Parties
line0008:
- Now for the model magic: I have created a prompt that will create a semantic hierarchy of the file, giving me the following information:
[
{
"title": "ARTICLE 11 - GRIEVANCE AND ARBITRATION RULES AND PROCEDURES",
"level": 1,
"token_count": 2878,
"start_line": "line0001",
"has_children": "Y",
"children": [
{
"title": "A. GENERAL RULES",
"level": 2,
"token_count": 1606,
"start_line": "line0003",
"has_children": "Y",
"children": [
{
"title": "1. Parties",
"level": 3,
"token_count": 170,
"start_line": "line0007",
"has_children": "N",
"children": []
},
My prompt also instructs the model to only include children segments if the parent segment is > 600 tokens. (about 2500 - 3000 characters). 600 tokens is my preferred chunk size, but it could be set to whatever one wishes.
- Next, I run code to add the end line numbers. I originally wanted the model to do this, but this is something else they proved to be particularly bad at. So now, I have each chunk of the document hierachally identified with exact start and end line numbers for the chunk text. My original goal for this entire adventure:
[
{
"title": "ARTICLE 11 - GRIEVANCE AND ARBITRATION RULES AND PROCEDURES",
"level": 1,
"token_count": 2878,
"start_line": "line0001",
"has_children": "Y",
"children": [
{
"title": "A. GENERAL RULES",
"level": 2,
"token_count": 1606,
"start_line": "line0003",
"has_children": "Y",
"children": [
{
"title": "1. Parties",
"level": 3,
"token_count": 170,
"start_line": "line0007",
"has_children": "N",
"children": [],
"end_line": "line0028"
},
{
"title": "2. Time Limits",
"level": 3,
"token_count": 283,
"start_line": "line0029",
"has_children": "N",
"children": [],
"end_line": "line0042"
},
{
"title": "3. Place of Hearing",
"level": 3,
"token_count": 136,
"start_line": "line0043",
"has_children": "N",
"children": [],
"end_line": "line0050"
},
- Now, I only need to extract and embed the chunks. Since I now know the start and end line, this should be pretty simple. By the way, this was also @jr.2509 's idea.
This code should also contain the title path for each chunk to indicate where it belongs in the overall hierarchy. In fact, I should actually do that in Step 4.
If I happen to run across one that is still > 600 tokens, then I’m thinking of using @jr.2509 's approach to further sub-chunk: Using gpt-4 API to Semantically Chunk Documents - #25 by jr.2509
The cool thing about this is when I’m all done, I should have chunks which all contain semantically whole ideas, all be under my chunk token maximum and be connected hierarchally to the larger source document.
Yes, it requires a lot of code, but only one API call to the model.
Sweet!