Using gpt-4 API to Semantically Chunk Documents

SomebodySysop · April 17, 2024, 8:07am

OK, I have made considerable progress over the past few days. It hasn’t been easy. While the prompt I created for the API call returned the information I wanted, the output was too large for gpt-4-turbo-preview or gemini 1.5 pro. So, I had to modify my steps:

export the pdf (or whatever) document to txt.
run code to prepend linenoxxxx:
send this numbered file to model along with instructions to create hierarchy json file
process this file with code to add end_line numbers and output that json file.
run code on json output to create the chunks.

So, here’s a little more detail on each step:

I export the source file (usually a pdf) to .txt format.
I place line numbers in the file because none of the models seem to be able to identify what line they are on, even in text files:

line0001:ARTICLE 11 - GRIEVANCE AND ARBITRATION RULES AND PROCEDURES
line0002:
line0003:A. GENERAL RULES
line0004:
line0005:Unless otherwise provided in this Article 11 or elsewhere in this Basic Agreement, the rules and procedures for grievance and arbitration shall be as follows:
line0006:
line0007:1. Parties
line0008:

Now for the model magic: I have created a prompt that will create a semantic hierarchy of the file, giving me the following information:

[
  {
    "title": "ARTICLE 11 - GRIEVANCE AND ARBITRATION RULES AND PROCEDURES",
    "level": 1,
    "token_count": 2878,
    "start_line": "line0001",
    "has_children": "Y",
    "children": [
      {
        "title": "A. GENERAL RULES",
        "level": 2,
        "token_count": 1606,
        "start_line": "line0003",
        "has_children": "Y",
        "children": [
          {
            "title": "1. Parties",
            "level": 3,
            "token_count": 170,
            "start_line": "line0007",
            "has_children": "N",
            "children": []
          },

My prompt also instructs the model to only include children segments if the parent segment is > 600 tokens. (about 2500 - 3000 characters). 600 tokens is my preferred chunk size, but it could be set to whatever one wishes.

Next, I run code to add the end line numbers. I originally wanted the model to do this, but this is something else they proved to be particularly bad at. So now, I have each chunk of the document hierachally identified with exact start and end line numbers for the chunk text. My original goal for this entire adventure:

[
    {
        "title": "ARTICLE 11 - GRIEVANCE AND ARBITRATION RULES AND PROCEDURES",
        "level": 1,
        "token_count": 2878,
        "start_line": "line0001",
        "has_children": "Y",
        "children": [
            {
                "title": "A. GENERAL RULES",
                "level": 2,
                "token_count": 1606,
                "start_line": "line0003",
                "has_children": "Y",
                "children": [
                    {
                        "title": "1. Parties",
                        "level": 3,
                        "token_count": 170,
                        "start_line": "line0007",
                        "has_children": "N",
                        "children": [],
                        "end_line": "line0028"
                    },
                    {
                        "title": "2. Time Limits",
                        "level": 3,
                        "token_count": 283,
                        "start_line": "line0029",
                        "has_children": "N",
                        "children": [],
                        "end_line": "line0042"
                    },
                    {
                        "title": "3. Place of Hearing",
                        "level": 3,
                        "token_count": 136,
                        "start_line": "line0043",
                        "has_children": "N",
                        "children": [],
                        "end_line": "line0050"
                    },

Now, I only need to extract and embed the chunks. Since I now know the start and end line, this should be pretty simple. By the way, this was also @jr.2509 's idea.

This code should also contain the title path for each chunk to indicate where it belongs in the overall hierarchy. In fact, I should actually do that in Step 4.

If I happen to run across one that is still > 600 tokens, then I’m thinking of using @jr.2509 's approach to further sub-chunk: Using gpt-4 API to Semantically Chunk Documents - #25 by jr.2509

The cool thing about this is when I’m all done, I should have chunks which all contain semantically whole ideas, all be under my chunk token maximum and be connected hierarchally to the larger source document.

Yes, it requires a lot of code, but only one API call to the model.

Sweet!

Topic		Replies	Views
New 4-turbo model has a unique limit? Or is this a bizarre hallucation? API	18	4627	January 26, 2024
Building first RAG system API	17	2037	July 6, 2025
Preparing data for embedding API	33	15344	December 16, 2023
The length of the embedding contents API	48	35417	December 13, 2023
Poor quality response on trained LLM with pdf files Community gpt-4	29	6832	May 1, 2024

Using gpt-4 API to Semantically Chunk Documents

Related topics