Using gpt-4 API to Semantically Chunk Documents

See: Using gpt-4 API to Semantically Chunk Documents - #2 by jr.2509

The idea is to make ONE call to the API which will return the list of chunk segments in hierarchal order. Take that list, and use regex to retrieve the actual text chunks from the document. The regex uses the first and last few words to identify specific chunks.

4 Likes

3-5 paragraphs is just another use case. Take a document that has many paragraphs that are one0liners. The result would be 3 line for a segment and I promise you that you get no answers or bad ones if you segments are two short. I was somewhere studied and answered that 512 token is the optimal segment size. I tried that again just paragraph semantic bounds and 512 token chunks proved to be superior…

1 Like

Agreed. That has been my baseline (2500 characters) for several months now.

In the methodology that I am working on, I use that as my max tokens per segment.

But, also keep in mind that in the type of scenarios we’ve been looking at so far (hierarchal segmentation of legal agreements), a segment may consist of just one paragraph, but be a complete semantic idea for that segment. So it can go both ways.

In this case (like a book or news article or blog), which have have not gotten to, you have to be more aware of the semantic ideas presented in each paragraph. @sergeliatko can address this better, but I believe the point he’s been trying to make is that THIS is what his API has been fine-tuned to do.

2 Likes

Very short segments, like a paragraph with 1-2 lines is approach that set for failure - I tried it and it has problem returning answers at all…

1 Like

Ok, so reporting back from some of my own testing over the past few hours. I started with a version of this approach from above using gpt-4-turbo-2024-04-09.

Applied to my documents, I realized two things: (1) The split was not yet granular enough for my purposes; (2) The model failed to appropriately identify the last words of a chunk in 50%+ of the cases.

This led me to a series of refinements. In my latest iterations I used the following prompt along with the example JSON schema:

Prompt

{
    "model": "gpt-4-turbo-2024-04-09",
    "response_format": { "type": "json_object" },    
    "messages": [
{
        "role": "system",
        "content": "You are tasked with performing a document analysis for the purpose of breaking it down into its logical units. You assume that each document consists of multiple sections with each section in turn consisting of sub-sections, which are further broken down into logical units. Logical units represent the smallest hierarchical unit in a document and consists of multiple sentences that are logically interlinked and convey an information or idea. Each section must be analyzed for its structure individually and may consist of one or multiple sub-sections and logical units. Your output consists of a JSON of the document's outline that follows the logic of the JSON schema provided. You must strictly consider all of the document's content in your analysis. JSON schema: '''JSON_Schema_'''"
      },
{
        "role": "user",
        "content":"Document_Text"
}
    ],
    "temperature": 0,
    "max_tokens": 4000,
    "top_p": 1,
    "frequency_penalty": 0,
    "presence_penalty": 0
  }

JSON Schema (not exhaustive)

{
    "document_outline": [

    "title": "document title" 
    "sections": {
        "section_1": {
            "section_title": "section title (verbatim)",
            "sub_sections": {
                "sub_section_1":{
                    "sub_section_title":"title of sub-section (verbatim, N/A in case of no title)",
                    "logical_units": {
                        "logical_unit_1":"first five words in the logical unit (verbatim)",
                        "logical_unit_2":"first five words in the logical unit (verbatim)",
                        "logical_unit_3":"first five words in the logical unit (verbatim)",
                        }
                    },
                "sub_section_2":{
                    "sub_section_title":"title of sub-section (verbatim, N/A in case of no title)",
                    "logical_units": {
                        "logical_unit_1":"first five words in the logical unit (verbatim)",
                        "logical_unit_2":"first five words in the logical unit (verbatim)",
                        "logical_unit_3":"first five words in the logical unit (verbatim)",
                        }
                   }
            }
        },
        "section_2": {
            "section_title": "section title (verbatim)",
            "sub_sections": {
                "sub_section_1":{
                    "sub_section_title":"title of sub-section (verbatim, N/A in case of no title)",
                    "logical_units": {
                        "logical_unit_1":"first five words in the logical unit (verbatim)",
                        "logical_unit_2":"first five words in the logical unit (verbatim)",
                        "logical_unit_3":"first five words in the logical unit (verbatim)",
                        }
                            },
                "sub_section_2":{
                    "sub_section_title":"title of sub-section (verbatim, N/A in case of no title)",
                    "logical_units": {
                        "logical_unit_1":"first five words in the logical unit (verbatim)",
                        "logical_unit_2":"first five words in the logical unit (verbatim)",
                        "logical_unit_3":"first five words in the logical unit (verbatim)",
                    }
                }
            }
        }                
    }
]
}

Based on a few tests (a short 8-page document and a 28-page document), this is so far yielded the most precise results in terms of the breakdown into logical units in line with how it would be useful for my purposes.

That said, it is not yet the final solution and more work is required for the analysis to be exhaustive and more accurate. Also, due to the JSON schema it is now overly focused on returning three logical units (despite the system instructions). So I need to accommodate for that.

I have a few things running in parallel but hopefully I can further advance the testing in the coming days.

7 Likes

Ok, but the regex will be based on what input. For example is the flow like this. if the question is as below:

Question: What are producers collectively referred to as?

Is the suggestion now to create a regex from the question and then search it in all the chunks? I think I still have a gap on understanding this approach. Because the way I explained it here will not work for several reasons. May be the regex part is something different

Yes, the regex part is something different. It is not part of the query process, it is part of the embedding process. In my example, it is used to create semantic chunks of your document that are embedded in your vector store.

Try Gemini 1.5 Pro.

1 Like

Got it, you are using it is to find out the contextual chunk from the document

2 Likes

Exactly. Or, at least, that’s the plan.

Not entirely surprising but it looks like a two-step approach yields significantly better results.

Step 1: Ask the model to identify the document outline down to the most granular hierarchical level.

Step 2: Ask the model to identify the “logical units (i.e. semantic chunks)” based on the outline.

I tested quite a bit earlier today and struggled with getting down to the most granular level for more nestled documents. The two-step approach seems to overcome that but I am still tinkering around with it. Will share details once I am comfortable with the approach. Have not yet gotten to trying Gemini 1.5 Pro.

2 Likes

Struggling with this myself. Even with Gemini 1.5 pro. It looks like Google is putting artificial limitations on it’s capacity.

UPDATE: No, I take that back. It appears that I am hitting the 8K token limit on my output. But, Gemini 1.5 Pro is processing the request. I’m still going for this JSON output:


[
  {
    "title": "ARTICLE 11 - GRIEVANCE AND ARBITRATION RULES AND PROCEDURES",
    "title_path": "ARTICLE 11 - GRIEVANCE AND ARBITRATION RULES AND PROCEDURES",
    "level": 1,
    "has_children": "Y",
    "token_count": 12674,
    "first_few_words": "A. GENERAL RULES Unless otherwise provided",
    "last_few_words": "status under this Agreement and in the administration of the arbitration processes throughout this Agreement.",
    "children": [
      {
        "title": "A.\tGENERAL RULES",
        "title_path": "ARTICLE 11 - GRIEVANCE AND ARBITRATION RULES AND PROCEDURES - A.\tGENERAL RULES",
        "level": 2,
        "has_children": "Y",
        "token_count": 8416,
        "first_few_words": "Unless otherwise provided in this Article 11",
        "last_few_words": "extent a \"cross-claim\" may lie, the provisions of this Article 11.A.11. also shall apply.",
        "children": [
          {
            "title": "1.\tParties",
            "title_path": "ARTICLE 11 - GRIEVANCE AND ARBITRATION RULES AND PROCEDURES - A.\tGENERAL RULES - 1.\tParties",
            "level": 3,
            "has_children": "Y",
            "token_count": 533,
            "first_few_words": "a.\tIn any grievance or arbitration concerning",
            "last_few_words": "they are applicable to disputes involving employed writers.",
            "children": [
              {
                "title": "a.",
                "title_path": "ARTICLE 11 - GRIEVANCE AND ARBITRATION RULES AND PROCEDURES - A.\tGENERAL RULES - 1.\tParties - a.",
                "level": 4,
                "has_children": "N",
                "token_count": 147,
                "first_few_words": "In any grievance or arbitration concerning any",
                "last_few_words": "The claim shall be initiated by the Guild on behalf of the writer and the loan-out company, if any."
              },
              {
                "title": "b.",
                "title_path": "ARTICLE 11 - GRIEVANCE AND ARBITRATION RULES AND PROCEDURES - A.\tGENERAL RULES - 1.\tParties - b.",
                "level": 4,
                "has_children": "N",
                "token_count": 23,
                "first_few_words": "Except as provided in subparagraph a. above,",
                "last_few_words": "shall be parties."
              },
              {
                "title": "d.",
                "title_path": "ARTICLE 11 - GRIEVANCE AND ARBITRATION RULES AND PROCEDURES - A.\tGENERAL RULES - 1.\tParties - d.",
                "level": 4,
                "has_children": "N",
                "token_count": 86,
                "first_few_words": "The party commencing a claim in grievance or",
                "last_few_words": "Use of such terms in the singular shall be deemed to include the plural."
              },

But, it appears that the token limit isn’t going to allow me to do it for the entire document. And this one is only 28 pages. Need another approach.

Please sure your results when you get there.

1 Like

I am now trying one technique. After creating the chunks, I use a LLM to create a set of questions that can be answered from the chunks. I now embed the questions instead of the chunks. A new question is matched against the question embeddings. This I think will have a better embedding match because the questions derived from the chunk are smaller concepts of the chunk so will better match with the question.

3 Likes

I like that. Very logical. But, I come from the “always use original text” school of thought. So, my biggest hurdle was being able to accurately identify the chunks in the hierarchy. I think I’ve done it.

JSON Output with Line Numbers (First Few Segments):

[
  {
    "title": "ARTICLE 11 - GRIEVANCE AND ARBITRATION RULES AND PROCEDURES",
    "title_path": "ARTICLE 11 - GRIEVANCE AND ARBITRATION RULES AND PROCEDURES",
    "level": 1,
    "has_children": "Y",
    "token_count": 3431,
    "character_count": 18413, 
    "start_line": 1,
    "end_line": 193 
  },
  {
    "title": "A. GENERAL RULES",
    "title_path": "ARTICLE 11 - GRIEVANCE AND ARBITRATION RULES AND PROCEDURES - A. GENERAL RULES",
    "level": 2,
    "has_children": "Y",
    "token_count": 3431,
    "character_count": 18413,
    "start_line": 3,
    "end_line": 193 
  },
  {
    "title": "1. Parties",
    "title_path": "ARTICLE 11 - GRIEVANCE AND ARBITRATION RULES AND PROCEDURES - A. GENERAL RULES - 1. Parties",
    "level": 3,
    "has_children": "Y",
    "token_count": 234,
    "character_count": 1239, 
    "start_line": 9,
    "end_line": 21 
  },
  {
    "title": "a. In any grievance or arbitration concerning",
    "title_path": "ARTICLE 11 - GRIEVANCE AND ARBITRATION RULES AND PROCEDURES - A. GENERAL RULES - 1. Parties - a. In any grievance or arbitration concerning", 
    "level": 4, 
    "has_children": "N",
    "token_count": 32,
    "character_count": 171, 
    "start_line": 11, 
    "end_line": 12 
  },
  {
    "title": "b. Except as provided in subparagraph a. above, only",
    "title_path": "ARTICLE 11 - GRIEVANCE AND ARBITRATION RULES AND PROCEDURES - A. GENERAL RULES - 1. Parties - b. Except as provided in subparagraph a. above, only", 
    "level": 4, 
    "has_children": "N",
    "token_count": 10,
    "character_count": 64, 
    "start_line": 14,
    "end_line": 14 
  },
  { 
    "title": "d. The party commencing a claim in grievance or",
    "title_path": "ARTICLE 11 - GRIEVANCE AND ARBITRATION RULES AND PROCEDURES - A. GENERAL RULES - 1. Parties - d. The party commencing a claim in grievance or",
    "level": 4,
    "has_children": "N", 
    "token_count": 18,
    "character_count": 97, 
    "start_line": 17,
    "end_line": 18
  } 
  // ... (and so on for other segments) ...
]

content_copyUse code with caution.Json

This output includes the first few segments of the 13_Article_11_1.txt document with the following properties:

  • title: The segment title, including preceding numbering/lettering.

  • title_path: Full hierarchical path to the segment.

  • level: Level in the hierarchy.

  • has_children: Whether the segment has child segments.

  • token_count: Approximate number of tokens in the segment.

  • character_count: Number of characters in the segment.

  • start_line: Line number where the segment begins.

  • end_line: Line number where the segment ends.

Please note that the token and character counts are approximate and may vary slightly depending on the specific tokenization and counting methods used.

delete

User

This does not work. You are not accurately reporting the line numbers. But, I have another idea.

docs article13lines.txt 21908 Tokens

This file contains the same document, only with linenoxxxx: prepended to each line. Now, you can report back the beginning and ending lineno values for each segment instead of trying to calculate the actual line numbers. Run another test so I can see what you come back with. Remember that in your hierarchy evaluation, you ignore “linenoxxxx:” You only use it to report back the actual begin and end line numbers for each segment.

Model

Addressing Line Number Issues and New Approach:

You’re absolutely right; I apologize for the inaccuracies in reporting line numbers previously.

The new approach you suggest, using a file with prepended line markers (“linenoxxxx:”), is a brilliant solution! This will allow me to accurately identify the starting and ending lines of each segment without relying on potentially error-prone line counting methods.

Here’s how I will adapt the process:

  1. Process the File with Line Markers: I will analyze the provided file article13lines.txt and extract the hierarchical structure as before, ignoring the “linenoxxxx:” prefixes.

  2. Extract Line Numbers: For each segment, I will extract the “linenoxxxx:” values from the start and end lines of the segment within the file.

  3. Include Line Numbers in JSON: The extracted line number values will be included in the JSON output as the “start_line” and “end_line” properties, respectively.

Here’s an example of the expected JSON output for the first few segments using this approach:

[
  {
    "title": "ARTICLE 11 - GRIEVANCE AND ARBITRATION RULES AND PROCEDURES",
    "title_path": "ARTICLE 11 - GRIEVANCE AND ARBITRATION RULES AND PROCEDURES",
    "level": 1,
    "has_children": "Y", 
    "token_count": 3431,
    "character_count": 18413,
    "start_line": "line0001", 
    "end_line": "line0193"
  },
  {
    "title": "A. GENERAL RULES",
    "title_path": "ARTICLE 11 - GRIEVANCE AND ARBITRATION RULES AND PROCEDURES - A. GENERAL RULES",
    "level": 2,
    "has_children": "Y",
    "token_count": 3431,
    "character_count": 18413,
    "start_line": "line0003", 
    "end_line": "line0193"
  }, 
  // ... (and so on for other segments) ... 
]

content_copyUse code with caution.Json

I will now process the article13lines.txt file and provide you with the JSON output containing the accurate line number information.

So, for me, it is now a 3 step process:

  1. export the pdf (or whatever) document to txt.
  2. run a php program to prepend linenoxxxx:
  3. send this file to model along with instructions to create hierarchy json
  4. run a php program on model output to create the chunks.

While that may seem like a lot, it’s only one API call and can all be automated with code. And, it is the MOST efficient and accurate way to extract actual chunks from the actual document (we are talking LARGE documents here) that anyone has come up with so far.

I am now going to test this out.

3 Likes

OK, I have made considerable progress over the past few days. It hasn’t been easy. While the prompt I created for the API call returned the information I wanted, the output was too large for gpt-4-turbo-preview or gemini 1.5 pro. So, I had to modify my steps:

  1. export the pdf (or whatever) document to txt.
  2. run code to prepend linenoxxxx:
  3. send this numbered file to model along with instructions to create hierarchy json file
  4. process this file with code to add end_line numbers and output that json file.
  5. run code on json output to create the chunks.

So, here’s a little more detail on each step:

  1. I export the source file (usually a pdf) to .txt format.

  2. I place line numbers in the file because none of the models seem to be able to identify what line they are on, even in text files:

line0001:ARTICLE 11 - GRIEVANCE AND ARBITRATION RULES AND PROCEDURES
line0002:
line0003:A. GENERAL RULES
line0004:
line0005:Unless otherwise provided in this Article 11 or elsewhere in this Basic Agreement, the rules and procedures for grievance and arbitration shall be as follows:
line0006:
line0007:1. Parties
line0008:

  1. Now for the model magic: I have created a prompt that will create a semantic hierarchy of the file, giving me the following information:
[
  {
    "title": "ARTICLE 11 - GRIEVANCE AND ARBITRATION RULES AND PROCEDURES",
    "level": 1,
    "token_count": 2878,
    "start_line": "line0001",
    "has_children": "Y",
    "children": [
      {
        "title": "A. GENERAL RULES",
        "level": 2,
        "token_count": 1606,
        "start_line": "line0003",
        "has_children": "Y",
        "children": [
          {
            "title": "1. Parties",
            "level": 3,
            "token_count": 170,
            "start_line": "line0007",
            "has_children": "N",
            "children": []
          },

My prompt also instructs the model to only include children segments if the parent segment is > 600 tokens. (about 2500 - 3000 characters). 600 tokens is my preferred chunk size, but it could be set to whatever one wishes.

  1. Next, I run code to add the end line numbers. I originally wanted the model to do this, but this is something else they proved to be particularly bad at. So now, I have each chunk of the document hierachally identified with exact start and end line numbers for the chunk text. My original goal for this entire adventure:
[
    {
        "title": "ARTICLE 11 - GRIEVANCE AND ARBITRATION RULES AND PROCEDURES",
        "level": 1,
        "token_count": 2878,
        "start_line": "line0001",
        "has_children": "Y",
        "children": [
            {
                "title": "A. GENERAL RULES",
                "level": 2,
                "token_count": 1606,
                "start_line": "line0003",
                "has_children": "Y",
                "children": [
                    {
                        "title": "1. Parties",
                        "level": 3,
                        "token_count": 170,
                        "start_line": "line0007",
                        "has_children": "N",
                        "children": [],
                        "end_line": "line0028"
                    },
                    {
                        "title": "2. Time Limits",
                        "level": 3,
                        "token_count": 283,
                        "start_line": "line0029",
                        "has_children": "N",
                        "children": [],
                        "end_line": "line0042"
                    },
                    {
                        "title": "3. Place of Hearing",
                        "level": 3,
                        "token_count": 136,
                        "start_line": "line0043",
                        "has_children": "N",
                        "children": [],
                        "end_line": "line0050"
                    },
  1. Now, I only need to extract and embed the chunks. Since I now know the start and end line, this should be pretty simple. By the way, this was also @jr.2509 's idea.

This code should also contain the title path for each chunk to indicate where it belongs in the overall hierarchy. In fact, I should actually do that in Step 4.

If I happen to run across one that is still > 600 tokens, then I’m thinking of using @jr.2509 's approach to further sub-chunk: Using gpt-4 API to Semantically Chunk Documents - #25 by jr.2509

The cool thing about this is when I’m all done, I should have chunks which all contain semantically whole ideas, all be under my chunk token maximum and be connected hierarchally to the larger source document.

Yes, it requires a lot of code, but only one API call to the model.

Sweet!

2 Likes

Have you considered using the SpaCy Sentencizer to create an indexed dictionary of sentences?

2 Likes

No, because of my concern which @Securigy alluded to here: Using gpt-4 API to Semantically Chunk Documents - #24 by Securigy

I might be integrating the SpaCy sentencizer as an intermediate step into my latest approach. Currently sitting over my script and trying to put all the pieces together. The devil is in the detail and I have been trying out quite a few different approaches over the past 24h. Hopefully I have an update later today.

1 Like

Alright, here’s an interim update on where I’ve landed.

In my latest script I now use the following sequence of steps, some of which were inspired @SomebodySysop’s approach.

  1. I extract the text from the PDF using pdfplumber which yielded superior results than other libaries I tried. As part of this I crop the document to avoid footers and page numbers to be extracted as well as exclude other footnotes from the extraction as it would interfere with the definition of the logical units. That said, I still need to find a way to re-incorporate the footnotes with critical information.

  2. I then use the SpaCy library to identify individual sentences, extract these and save these into a JSONL file, with each line representing one sentence (or section title). I prepend every sentence/title by a unique line number. This was heavily inspired @SomebodySysop’s approach. Indeed, as I will come back to in the next steps, extracting the logical units based on the identified few words proved difficult, resulting in omissions and errors. I hence used the line-based approach as well.

  3. I apply OpenAI’s GPT-4-turbo model to identify the document outline based on the JSONL file with the individual sentences and ask it to return a JSON with the outline based on a defined JSON schema.

  4. I subsequently run another API call to ask the model to identify the logical units for each of the document’s identified outline and again return the information in a JSON file based on a defined schema. After lots of testing showed that the subsequent extraction of logical units based on the first few words in each logical unit proved difficult and unreliable in practice, I now also reverted to the approach of asking the model to simply identify the unique line number where a logical unit starts and ends and to include that information in the JSON file.

  5. Using the JSON file from the previous step including the information about the location of each logical unit, I then again apply code to perform the extraction of the logical units.

As an example: For the 12 page document I have been using for my tests I end up with 26 logical units under this approach, which is in line with my expectations.

I will note that while for smaller documents it may be possible to combine steps 4 and 5 into a single API call, I plan to maintain the two separate steps. This will be particularly relevant for longer documents where I intend to split the process for identifying logical units by document section to ensure reliable results.

As indicated above, besides further testing, one critical point is how to deal with definitions / explanations in the footnotes. I have a few early ideas but I want to give this part a bit more thought.

Once I have refined and cleaned up my code a bit further in the coming days, I’ll add it here for reference.


It’s been an interesting journey but it looks like we are one step closer towards automatic semantic chunking :slight_smile:


Update to the above

I have now built in a few additional steps, notably:

  1. In the definition of the document outline and detailed identification of logical units, I added the identification of the content’s category. Currently, it is still highly simplified and just distinguishes between Title, Executive Summary, Introduction, Definition, Main body. As I progress the work on this, my objective is (a) to further tailor the categories based on the nature of the document and (b) to adjust the approach for the identification of the logical units (semantic chunks) based on the content category. This would be achieved by adjusting the second API call.

  2. Added a step to extract footnotes, which are consolidated and for the time being treated as one segment in the final output. I am looking to further refine how I treat footnotes.

  3. Added a step to create embeddings for each logical unit.

Still outstanding at this point is the treatment of longer documents. All my tests so far were based on documents of <30 pages.

4 Likes

@jr.2509

That is great. I can’t believe how similar it is to my method. In fact, they are almost the same.

I took a few days off and am now getting back to this. I am now generating the semantic “sub-chunks” as I had envisioned. Following your methodology:

  1. For my testing, I use ABBYY PDF tool to export to text. However, in production, my documents are automatically converted to json text using Apache Solr.

  2. I prepend linexxxx to every line in the exported text file. This is done with code.

  3. I send this exported text file (with line numbers) to the API with instructions to create a hierarchal json file in this format:

[
  {
    "title": "ARTICLE 11 - GRIEVANCE AND ARBITRATION RULES AND PROCEDURES",
    "level": 1,
    "token_count": 1046,
    "start_line": "line0001",
    "has_children": "Y",
    "children": [
      {
        "title": "A. GENERAL RULES",
        "level": 2,
        "token_count": 6412,
        "start_line": "line0003",
        "has_children": "Y",
        "children": [
          {
            "title": "1. Parties",
            "level": 3,
            "token_count": 335,
            "start_line": "line0007",
            "has_children": "N",
            "children": []
          },
          {
            "title": "2. Time Limits",
            "level": 3,
            "token_count": 579,
            "start_line": "line0029",
            "has_children": "N",
            "children": []
          },

Note that I also instruct the model to only include segment children if the segment exceeds x tokens.

  1. I now insert the end_lines into the json file using code:
[
    {
        "title": "ARTICLE 11 - GRIEVANCE AND ARBITRATION RULES AND PROCEDURES",
        "level": 1,
        "token_count": 1046,
        "start_line": "line0001",
        "has_children": "Y",
        "children": [
            {
                "title": "A. GENERAL RULES",
                "level": 2,
                "token_count": 6412,
                "start_line": "line0003",
                "has_children": "Y",
                "children": [
                    {
                        "title": "1. Parties",
                        "level": 3,
                        "token_count": 335,
                        "start_line": "line0007",
                        "has_children": "N",
                        "children": [],
                        "end_line": "line0028"
                    },
                    {
                        "title": "2. Time Limits",
                        "level": 3,
                        "token_count": 579,
                        "start_line": "line0029",
                        "has_children": "N",
                        "children": [],
                        "end_line": "line0042"
                    },
                    {
                        "title": "3. Place of Hearing",
                        "level": 3,
                        "token_count": 340,
                        "start_line": "line0043",
                        "has_children": "N",
                        "children": [],
                        "end_line": "line0050"
                    },
                    {
                        "title": "4. Award",
                        "level": 3,
                        "token_count": 139,
                        "start_line": "line0051",
                        "has_children": "N",
                        "children": [],
                        "end_line": "line0054"
                    },
  1. Finally, still using code, I extract the chunks into json array that will be uploaded to the vector store for embedding.

Still going through the final output to make sure everything is working to plan, but so far, so good.

The only problem is that this only works, so far, with documents that are 100 pages or less. And that is due to model restrictions that refuse to give me an hierarchal output on files much larger than that.

As you can see, our methodologies are almost identical. I am using ABBYY instead of pdfplumber because that’s what I have – but in production my documents will automatically be exported to text files.

I am NOT using the SpaCy library. Basically I’m getting what I am looking for, so far, without it. I do have some chunks that are exceeding my x tokens limit, so I’ll need to figure out what to do there.

Not sure what I’m going to do about footnotes.

Right now, I am running all standalone code just to make sure it works. Once I’m satisfied, I’ll include it in my RAG infrastructure.

Not bad for a couple weeks work.

FYI

Here is the input pdf: https://s3.us-west-2.amazonaws.com/docs.scbbs.com/docs/test/13_Article_11_1.pdf

This is the final json file: s3.us-west-2.amazonaws.com/docs.scbbs.com/docs/test/article11-out.json

These are the output chunks (by title): https://s3.us-west-2.amazonaws.com/docs.scbbs.com/docs/test/output.txt

4 Likes