Using gpt-4 API to Semantically Chunk Documents

OK, I have made considerable progress over the past few days. It hasn’t been easy. While the prompt I created for the API call returned the information I wanted, the output was too large for gpt-4-turbo-preview or gemini 1.5 pro. So, I had to modify my steps:

  1. export the pdf (or whatever) document to txt.
  2. run code to prepend linenoxxxx:
  3. send this numbered file to model along with instructions to create hierarchy json file
  4. process this file with code to add end_line numbers and output that json file.
  5. run code on json output to create the chunks.

So, here’s a little more detail on each step:

  1. I export the source file (usually a pdf) to .txt format.

  2. I place line numbers in the file because none of the models seem to be able to identify what line they are on, even in text files:

line0001:ARTICLE 11 - GRIEVANCE AND ARBITRATION RULES AND PROCEDURES
line0002:
line0003:A. GENERAL RULES
line0004:
line0005:Unless otherwise provided in this Article 11 or elsewhere in this Basic Agreement, the rules and procedures for grievance and arbitration shall be as follows:
line0006:
line0007:1. Parties
line0008:

  1. Now for the model magic: I have created a prompt that will create a semantic hierarchy of the file, giving me the following information:
[
  {
    "title": "ARTICLE 11 - GRIEVANCE AND ARBITRATION RULES AND PROCEDURES",
    "level": 1,
    "token_count": 2878,
    "start_line": "line0001",
    "has_children": "Y",
    "children": [
      {
        "title": "A. GENERAL RULES",
        "level": 2,
        "token_count": 1606,
        "start_line": "line0003",
        "has_children": "Y",
        "children": [
          {
            "title": "1. Parties",
            "level": 3,
            "token_count": 170,
            "start_line": "line0007",
            "has_children": "N",
            "children": []
          },

My prompt also instructs the model to only include children segments if the parent segment is > 600 tokens. (about 2500 - 3000 characters). 600 tokens is my preferred chunk size, but it could be set to whatever one wishes.

  1. Next, I run code to add the end line numbers. I originally wanted the model to do this, but this is something else they proved to be particularly bad at. So now, I have each chunk of the document hierachally identified with exact start and end line numbers for the chunk text. My original goal for this entire adventure:
[
    {
        "title": "ARTICLE 11 - GRIEVANCE AND ARBITRATION RULES AND PROCEDURES",
        "level": 1,
        "token_count": 2878,
        "start_line": "line0001",
        "has_children": "Y",
        "children": [
            {
                "title": "A. GENERAL RULES",
                "level": 2,
                "token_count": 1606,
                "start_line": "line0003",
                "has_children": "Y",
                "children": [
                    {
                        "title": "1. Parties",
                        "level": 3,
                        "token_count": 170,
                        "start_line": "line0007",
                        "has_children": "N",
                        "children": [],
                        "end_line": "line0028"
                    },
                    {
                        "title": "2. Time Limits",
                        "level": 3,
                        "token_count": 283,
                        "start_line": "line0029",
                        "has_children": "N",
                        "children": [],
                        "end_line": "line0042"
                    },
                    {
                        "title": "3. Place of Hearing",
                        "level": 3,
                        "token_count": 136,
                        "start_line": "line0043",
                        "has_children": "N",
                        "children": [],
                        "end_line": "line0050"
                    },
  1. Now, I only need to extract and embed the chunks. Since I now know the start and end line, this should be pretty simple. By the way, this was also @jr.2509 's idea.

This code should also contain the title path for each chunk to indicate where it belongs in the overall hierarchy. In fact, I should actually do that in Step 4.

If I happen to run across one that is still > 600 tokens, then I’m thinking of using @jr.2509 's approach to further sub-chunk: Using gpt-4 API to Semantically Chunk Documents - #25 by jr.2509

The cool thing about this is when I’m all done, I should have chunks which all contain semantically whole ideas, all be under my chunk token maximum and be connected hierarchally to the larger source document.

Yes, it requires a lot of code, but only one API call to the model.

Sweet!

2 Likes