Using gpt-4 API to Semantically Chunk Documents

@jr.2509

That is great. I can’t believe how similar it is to my method. In fact, they are almost the same.

I took a few days off and am now getting back to this. I am now generating the semantic “sub-chunks” as I had envisioned. Following your methodology:

  1. For my testing, I use ABBYY PDF tool to export to text. However, in production, my documents are automatically converted to json text using Apache Solr.

  2. I prepend linexxxx to every line in the exported text file. This is done with code.

  3. I send this exported text file (with line numbers) to the API with instructions to create a hierarchal json file in this format:

[
  {
    "title": "ARTICLE 11 - GRIEVANCE AND ARBITRATION RULES AND PROCEDURES",
    "level": 1,
    "token_count": 1046,
    "start_line": "line0001",
    "has_children": "Y",
    "children": [
      {
        "title": "A. GENERAL RULES",
        "level": 2,
        "token_count": 6412,
        "start_line": "line0003",
        "has_children": "Y",
        "children": [
          {
            "title": "1. Parties",
            "level": 3,
            "token_count": 335,
            "start_line": "line0007",
            "has_children": "N",
            "children": []
          },
          {
            "title": "2. Time Limits",
            "level": 3,
            "token_count": 579,
            "start_line": "line0029",
            "has_children": "N",
            "children": []
          },

Note that I also instruct the model to only include segment children if the segment exceeds x tokens.

  1. I now insert the end_lines into the json file using code:
[
    {
        "title": "ARTICLE 11 - GRIEVANCE AND ARBITRATION RULES AND PROCEDURES",
        "level": 1,
        "token_count": 1046,
        "start_line": "line0001",
        "has_children": "Y",
        "children": [
            {
                "title": "A. GENERAL RULES",
                "level": 2,
                "token_count": 6412,
                "start_line": "line0003",
                "has_children": "Y",
                "children": [
                    {
                        "title": "1. Parties",
                        "level": 3,
                        "token_count": 335,
                        "start_line": "line0007",
                        "has_children": "N",
                        "children": [],
                        "end_line": "line0028"
                    },
                    {
                        "title": "2. Time Limits",
                        "level": 3,
                        "token_count": 579,
                        "start_line": "line0029",
                        "has_children": "N",
                        "children": [],
                        "end_line": "line0042"
                    },
                    {
                        "title": "3. Place of Hearing",
                        "level": 3,
                        "token_count": 340,
                        "start_line": "line0043",
                        "has_children": "N",
                        "children": [],
                        "end_line": "line0050"
                    },
                    {
                        "title": "4. Award",
                        "level": 3,
                        "token_count": 139,
                        "start_line": "line0051",
                        "has_children": "N",
                        "children": [],
                        "end_line": "line0054"
                    },
  1. Finally, still using code, I extract the chunks into json array that will be uploaded to the vector store for embedding.

Still going through the final output to make sure everything is working to plan, but so far, so good.

The only problem is that this only works, so far, with documents that are 100 pages or less. And that is due to model restrictions that refuse to give me an hierarchal output on files much larger than that.

As you can see, our methodologies are almost identical. I am using ABBYY instead of pdfplumber because that’s what I have – but in production my documents will automatically be exported to text files.

I am NOT using the SpaCy library. Basically I’m getting what I am looking for, so far, without it. I do have some chunks that are exceeding my x tokens limit, so I’ll need to figure out what to do there.

Not sure what I’m going to do about footnotes.

Right now, I am running all standalone code just to make sure it works. Once I’m satisfied, I’ll include it in my RAG infrastructure.

Not bad for a couple weeks work.

FYI

Here is the input pdf: https://s3.us-west-2.amazonaws.com/docs.scbbs.com/docs/test/13_Article_11_1.pdf

This is the final json file: s3.us-west-2.amazonaws.com/docs.scbbs.com/docs/test/article11-out.json

These are the output chunks (by title): https://s3.us-west-2.amazonaws.com/docs.scbbs.com/docs/test/output.txt

4 Likes