Using gpt-4 API to Semantically Chunk Documents

SomebodySysop · April 24, 2024, 6:03am

That is great. I can’t believe how similar it is to my method. In fact, they are almost the same.

I took a few days off and am now getting back to this. I am now generating the semantic “sub-chunks” as I had envisioned. Following your methodology:

For my testing, I use ABBYY PDF tool to export to text. However, in production, my documents are automatically converted to json text using Apache Solr.
I prepend linexxxx to every line in the exported text file. This is done with code.
I send this exported text file (with line numbers) to the API with instructions to create a hierarchal json file in this format:

[
  {
    "title": "ARTICLE 11 - GRIEVANCE AND ARBITRATION RULES AND PROCEDURES",
    "level": 1,
    "token_count": 1046,
    "start_line": "line0001",
    "has_children": "Y",
    "children": [
      {
        "title": "A. GENERAL RULES",
        "level": 2,
        "token_count": 6412,
        "start_line": "line0003",
        "has_children": "Y",
        "children": [
          {
            "title": "1. Parties",
            "level": 3,
            "token_count": 335,
            "start_line": "line0007",
            "has_children": "N",
            "children": []
          },
          {
            "title": "2. Time Limits",
            "level": 3,
            "token_count": 579,
            "start_line": "line0029",
            "has_children": "N",
            "children": []
          },

Note that I also instruct the model to only include segment children if the segment exceeds x tokens.

I now insert the end_lines into the json file using code:

[
    {
        "title": "ARTICLE 11 - GRIEVANCE AND ARBITRATION RULES AND PROCEDURES",
        "level": 1,
        "token_count": 1046,
        "start_line": "line0001",
        "has_children": "Y",
        "children": [
            {
                "title": "A. GENERAL RULES",
                "level": 2,
                "token_count": 6412,
                "start_line": "line0003",
                "has_children": "Y",
                "children": [
                    {
                        "title": "1. Parties",
                        "level": 3,
                        "token_count": 335,
                        "start_line": "line0007",
                        "has_children": "N",
                        "children": [],
                        "end_line": "line0028"
                    },
                    {
                        "title": "2. Time Limits",
                        "level": 3,
                        "token_count": 579,
                        "start_line": "line0029",
                        "has_children": "N",
                        "children": [],
                        "end_line": "line0042"
                    },
                    {
                        "title": "3. Place of Hearing",
                        "level": 3,
                        "token_count": 340,
                        "start_line": "line0043",
                        "has_children": "N",
                        "children": [],
                        "end_line": "line0050"
                    },
                    {
                        "title": "4. Award",
                        "level": 3,
                        "token_count": 139,
                        "start_line": "line0051",
                        "has_children": "N",
                        "children": [],
                        "end_line": "line0054"
                    },

Finally, still using code, I extract the chunks into json array that will be uploaded to the vector store for embedding.

Still going through the final output to make sure everything is working to plan, but so far, so good.

The only problem is that this only works, so far, with documents that are 100 pages or less. And that is due to model restrictions that refuse to give me an hierarchal output on files much larger than that.

As you can see, our methodologies are almost identical. I am using ABBYY instead of pdfplumber because that’s what I have – but in production my documents will automatically be exported to text files.

I am NOT using the SpaCy library. Basically I’m getting what I am looking for, so far, without it. I do have some chunks that are exceeding my x tokens limit, so I’ll need to figure out what to do there.

Not sure what I’m going to do about footnotes.

Right now, I am running all standalone code just to make sure it works. Once I’m satisfied, I’ll include it in my RAG infrastructure.

Not bad for a couple weeks work.

FYI

Here is the input pdf: https://s3.us-west-2.amazonaws.com/docs.scbbs.com/docs/test/13_Article_11_1.pdf

This is the final json file: s3.us-west-2.amazonaws.com/docs.scbbs.com/docs/test/article11-out.json

These are the output chunks (by title): https://s3.us-west-2.amazonaws.com/docs.scbbs.com/docs/test/output.txt

Topic		Replies	Views
Seeking Help with Formatting a Large LaTeX File API embeddings , chatgpt , api , openai	5	884	December 16, 2023
Practical Tips for Dealing with Large Documents (>2048 tokens) API	6	7924	December 17, 2023
How to Optimize Text Chunking for Improved Embedding Vectorization? API vector-db , semantic-search	6	6875	December 15, 2023
New 4-turbo model has a unique limit? Or is this a bizarre hallucation? API	18	4093	January 26, 2024
Summarizing and extracting structured data from long text Prompting gpt-4 , api , token , limitations	14	9126	February 19, 2024

Using gpt-4 API to Semantically Chunk Documents

Related Topics