Using gpt-4 API to Semantically Chunk Documents

I am creating this topic based upon the thread of conversations which developed from this post: RAG is not really a solution - #43 by SomebodySysop

While a number of us have different approaches to the first tier of “Semantic Chunking”, like @sergeliatko 's API: RAG is not really a solution - #50 by sergeliatko we were all still looking for a way to have the LLM do that last bit of “sub-chunking”, where instead of simply chopping the chunk up by size we have the model look at it and determine the best way to divide it based upon semantic relevance.

@jr.2509 , @sergeliatko , @joyasree78 have all been working on approaches, but as I’ve started looking at this, I’m seeing what some of the real issues are.

So here is my latest idea to create a methodology to have gpt-4 API assist with semantic chunking of documents. The concept is simple: I send it a text document (like a contract or other agreement) with instructions to create a hierarchal outline. Once I get the outline, I then send it back to the model with instructions to “chunk” the document according to each segment of the outline. Basically, it will return the segment of text represented by each outline element.

The execution, however, is a bit more complicated. First, we have the 4K token context output limit to deal with, so no segment can exceed that. Secondly, this is going to require multiple calls where the model must examine the full text and return the next segment from the outline.

My question is: Can I upload the source text document once, and then have gpt-4 reference it with each successive API call? If that is possible, this might just work. If I have to upload the source text file with each successive API call, that will be prohibitively expensive. Suggestions?

2 Likes

There’s a very very nascent idea that I have been toying with in my mind over the past few days. What if we could just get the model to return the boundaries of the semantic chunk, i.e. the first few and last few words that would make the chunk uniquely identifiable.

With that information you could then likely just apply a regular script to extract the actual text of the chunks. If that was possible, then a single or reduce number of API calls might be enough and thus would save time and costs.

1 Like

I had the same, exact idea. My last ditch effort after the model said it couldn’t give me exact character positions (first idea) and could not give me line numbers (2nd idea).

This is what I just got from Gemini 1.5 Pro:

Providing First and Last Words of Listed Segments:

Yes, that’s another excellent idea! Given a hierarchical list of segments, I can analyze the document and provide the first and last few words of the text within each listed segment. This would give you a preview of the segment’s content and help you identify the specific sections you need to extract or process further.

Here’s how we can implement this:

1. You Provide the Hierarchical List:

  • You would give me a structured list of the segments you’re interested in. This list could be in various formats, such as:

    • Outline-style: Using Roman numerals, numbers, and letters to represent the hierarchy (e.g., “I. A. 1.”).

    • Title-based: Listing the titles of the sections and subsections (e.g., “Recognition”, “Scope”).

    • Combined: Combining both the outline structure and titles for clarity.

2. I Analyze the Document:

  • I would use the provided hierarchical list to identify the corresponding segments within the document.

  • For each segment, I would extract the first and last few words (e.g., the first 5 and last 5 words) of the segment’s text.

3. I Return the Results:

  • I would provide you with a list containing the extracted first and last words for each segment you specified. This list could be in a format like this:
Segment: I. A. 1. Recognition
First Words: The Union is recognized...
Last Words: ...not considered "performers." 

Segment: I. A. 2. Scope 
First Words: When Producer has its base...
Last Words: ...apply to background actors.

content_copyUse code with caution.

Advantages of this Approach:

  • Efficiency: This method can be more efficient than extracting entire segments, especially for large documents, as it only requires processing a small portion of each segment’s text.

  • Preview and Identification: The extracted first and last words give you a quick preview of the segment’s content, helping you determine if it’s the section you’re looking for.

  • Flexibility: You can easily specify which segments you’re interested in, allowing for targeted extraction or processing.

Potential Challenges:

  • Segment Boundary Ambiguity: As mentioned before, depending on the document’s structure, there might be ambiguities in identifying precise segment boundaries. This could affect the accuracy of the extracted first and last words.

  • Contextual Understanding: Extracting only a few words might not provide enough context to fully understand the segment’s content or purpose.

We can discuss and refine the details of this approach, such as the number of words to extract or the format of the output, to best suit your needs and the specific characteristics of your document.

I’ll give it a try with gpt-4 tomorrow. Great idea!

1 Like

Nice! This looks promising and I will also try to find some time over the weekend to run some preliminary tests. Excited to see where we land with this little thread.

This is where I failed about a year and a half ago: the document size is unknown and no guarantee it will fit into attention window for a model to generate a valid outline (especially with parent child relationships)…

In my approach the outline is the output of sematic chunking process. And it is generated by a simple code, no ai involved.

As I said in the thread you cited, you need first to analyse the way humans understand the text and build your process accordingly splitting the “comprehension” into a bunch of simple subtasks that can be performed in parallel on randomly cut pieces of text + using the classic code to join the pieces together when needed.

1 Like

The pipeline we built at LAWXER has:

  • proprietary OpenAI API client supporting async and parallel requests in batches
  • long tasks processing server with async API to handle document processing/results reposting
  • a bunch (12) fine-tuned models to perform “comprehension” module subtasks
  • app logic to orchestrate the models and intermittent results processing
  • OCR service
  • raw text splitter (regex based) to chop unformatted (with random line breaks) strings of characters into limited length strings ending on paragraph edges.
  • … Some minor text tools…

Close to 50k lines of code written to make the thing run. Models trained mostly on legal documents, but no issue to retrain on general texts or some specific domain. However, still shaky results with tables where data pieces are small and some cells are merged.

As for API calls, there are about 15 steps, each having a batch of requests called in parallel.

1 Like

OK, this is what I have so far:

Here is a document I am testing with:

https://labor.booksai.org/sites/labor.booksai.org/files/002_SAGAFTRA%20General%20Provisions_Part1_legal_0.pdf – note the 1. 2. 3. Etc. sections

This document is 64,707 tokens, 298785 characters and 139 pages.

I export the source document to .txt file and send it to model.

Prompt:

Please give me an outline of the hierarchal structure of this document.

Return a json array of the outline with each outline segment formatted like this:

title (actual title - include preceding letter, number or roman numeral if applicable)
title path (the hierarchal title path = level 1 title - level 2 title - etc… including this title)
token count (total tokens for this segment)
first few words (first few words that begin the segment - this should be at least 5)
last few words (the last few words of the segment - this should be at least 5)

I asked it to just analyze the first few lines:

[
  {
    "title": "Preamble",
    "title_path": "Preamble",
    "token_count": 23,
    "first_few_words": "PRODUCER – SAG-AFTRA CODIFIED BASIC AGREEMENT",
    "last_few_words": "referred to as \"Producers.\""
  },
  {
    "title": "Witness Eth",
    "title_path": "Witness Eth",
    "token_count": 14,
    "first_few_words": "In consideration of the mutual agreements",
    "last_few_words": "agreed as follows:"
  },
  {
    "title": "General Provisions",
    "title_path": "General Provisions",
    "token_count": 15432, 
    "first_few_words": "1. RECOGNITION AND SCOPE OF AGREEMENT",
    "last_few_words": "standards at all studios with respect to the employment of performers;" 
  },
  {
    "title": "Recognition and Scope of Agreement",
    "title_path": "General Provisions - Recognition and Scope of Agreement",
    "token_count": 438, 
    "first_few_words": "A. Recognition The Union is recognized",
    "last_few_words": "provisions of Schedule X, Part II shall apply to background actors employed in the New York Zone."
  },
  {
    "title": "Recognition",
    "title_path": "General Provisions - Recognition and Scope of Agreement - Recognition",
    "token_count": 287, 
    "first_few_words": "The Union is recognized by Producer as",
    "last_few_words": "Background actors are not considered \"performers.\""
  },
  {
    "title": "Scope",
    "title_path": "General Provisions - Recognition and Scope of Agreement - Scope",
    "token_count": 151,
    "first_few_words": "(1) When Producer has its base of",
    "last_few_words": "of Schedule X, Part I shall apply to the employment of background actors in the Los Angeles, Las Vegas, Sacramento, San Diego, San Francisco and Hawaii Zones."
  },
  {
    "title": "Union Security",
    "title_path": "General Provisions - Union Security",
    "token_count": 785,
    "first_few_words": "A. Every performer hereafter employed by",
    "last_few_words": "to arbitration between the Union and the Producer under Section 9 of these General Provisions."
  },
  {
    "title": "Strikes",
    "title_path": "General Provisions - Strikes",
    "token_count": 1406, 
    "first_few_words": "A. No-Strike Clause The Union agrees",
    "last_few_words": "action."
  }, 
  {
    "title": "No-Strike Clause",
    "title_path": "General Provisions - Strikes - No-Strike Clause",
    "token_count": 31,
    "first_few_words": "The Union agrees that, during the",
    "last_few_words": "signatory hereto."
  },
  {
    "title": "Rights and Duties of Union Members and Producers",
    "title_path": "General Provisions - Strikes - Rights and Duties of Union Members and Producers",
    "token_count": 1375,
    "first_few_words": "(1) If, after the expiration or other",
    "last_few_words": "Producer and Union."
  }
]

content_copyUse code with caution.Json

Explanation:

  • The output is a JSON array containing objects, each representing a segment from the hierarchical outline.

  • Each object has the following properties:

    • “title”: The actual title of the segment.

    • “title_path”: The hierarchical path to the segment, including the titles of all parent sections.

    • “token_count”: The total number of tokens in the segment’s text.

    • “first_few_words”: The first few words of the segment’s text.

    • “last_few_words”: The last few words of the segment’s text.

Please note that the token count is an approximation and might vary slightly depending on the specific tokenization method used.

This test demonstrates the feasibility of the approach and provides a structured way to access information about each segment, including its title, location in the hierarchy, size, and a preview of its content.

After a few adjustments, this is now what I am getting:

[
  {
    "title": "I. Preamble",
    "title_path": "I. Preamble",
    "token_count": 23,
    "first_few_words": "PRODUCER – SAG-AFTRA CODIFIED BASIC AGREEMENT",
    "last_few_words": "\"Producer\" and collectively referred to as \"Producers.\""
  },
  {
    "title": "II. Witness Eth",
    "title_path": "II. Witness Eth",
    "token_count": 14,
    "first_few_words": "In consideration of the mutual agreements",
    "last_few_words": "hereinafter contained, it is agreed as follows:"
  },
  {
    "title": "III. General Provisions",
    "title_path": "III. General Provisions",
    "token_count": 15432, 
    "first_few_words": "1. RECOGNITION AND SCOPE OF AGREEMENT",
    "last_few_words": "to coding; It shall study and review the appropriateness of the Section relating to per diem rates;"
  },
  {
    "title": "1. Recognition and Scope of Agreement",
    "title_path": "III. General Provisions - 1. Recognition and Scope of Agreement",
    "token_count": 438, 
    "first_few_words": "A. Recognition The Union is recognized",
    "last_few_words": "Part II shall apply to background actors employed in the New York Zone." 
  },
  {
    "title": "A. Recognition",
    "title_path": "III. General Provisions - 1. Recognition and Scope of Agreement - A. Recognition",
    "token_count": 287,
    "first_few_words": "The Union is recognized by Producer as", 
    "last_few_words": "and body doubles. Background actors are not considered \"performers.\"" 
  },
  {
    "title": "B. Scope",
    "title_path": "III. General Provisions - 1. Recognition and Scope of Agreement - B. Scope", 
    "token_count": 151,
    "first_few_words": "(1) When Producer has its base of",
    "last_few_words": "Las Vegas, Sacramento, San Diego, San Francisco and Hawaii Zones. Only the provisions" 
  },
  {
    "title": "2. Union Security",
    "title_path": "III. General Provisions - 2. Union Security",
    "token_count": 785,
    "first_few_words": "A. Every performer hereafter employed by", 
    "last_few_words": "breach of the provisions of this Section shall be subject to arbitration between the Union and the Producer"
  },
  {
    "title": "3. Strikes",
    "title_path": "III. General Provisions - 3. Strikes", 
    "token_count": 1406, 
    "first_few_words": "A. No-Strike Clause The Union agrees",
    "last_few_words": "proceedings in accordance with its by-laws against the participants in such unauthorized"
  },
  {
    "title": "A. No-Strike Clause",
    "title_path": "III. General Provisions - 3. Strikes - A. No-Strike Clause", 
    "token_count": 31,
    "first_few_words": "The Union agrees that, during the",
    "last_few_words": "strike affecting motion picture production against any Producer signatory hereto."
  },
  {
    "title": "B. Rights and Duties of Union Members and Producers", 
    "title_path": "III. General Provisions - 3. Strikes - B. Rights and Duties of Union Members and Producers",
    "token_count": 1375,
    "first_few_words": "(1) If, after the expiration or other",
    "last_few_words": "strike, unless this covenant be sooner terminated by the written consent of Producer and Union."
  }
]

Then, I realized I needed to add the levels as well:

Adding Hierarchy Levels to the JSON:

[
  {
    "title": "I. Preamble",
    "title_path": "I. Preamble",
    "level": 1, 
    "token_count": 23,
    "first_few_words": "PRODUCER – SAG-AFTRA CODIFIED BASIC AGREEMENT",
    "last_few_words": "\"Producer\" and collectively referred to as \"Producers.\""
  },
  {
    "title": "II. Witness Eth",
    "title_path": "II. Witness Eth",
    "level": 1,
    "token_count": 13,
    "first_few_words": "WITNESSETH: In consideration of the mutual",
    "last_few_words": "agreements hereinafter contained, it is agreed as follows:"
  },
  {
    "title": "III. General Provisions",
    "title_path": "III. General Provisions",
    "level": 1,
    "token_count": 15432,
    "first_few_words": "1. RECOGNITION AND SCOPE OF AGREEMENT",
    "last_few_words": "to coding; It shall study and review the appropriateness of the Section relating to per diem rates;"
  },
  {
    "title": "1. Recognition and Scope of Agreement",
    "title_path": "III. General Provisions - 1. Recognition and Scope of Agreement",
    "level": 2,
    "token_count": 438,
    "first_few_words": "A. Recognition The Union is recognized",
    "last_few_words": "Part II shall apply to background actors employed in the New York Zone."
  },
  {
    "title": "A. Recognition",
    "title_path": "III. General Provisions - 1. Recognition and Scope of Agreement - A. Recognition", 
    "level": 3,
    "token_count": 287,
    "first_few_words": "The Union is recognized by Producer as",
    "last_few_words": "and body doubles. Background actors are not considered \"performers.\""
  },
  // ... (and so on for other segments, with appropriate level values) ...
]

content_copyUse code with caution.Json

I’ve added a “level” property to each object, indicating the segment’s level in the hierarchy. The levels are assigned as follows:

  • Level 1: Main sections (Preamble, Witness Eth, General Provisions, Exhibits, Letter Agreement)

  • Level 2: Numbered sections within General Provisions (1., 2., 3., etc.)

  • Level 3: Lettered subsections within numbered sections (A., B., C., etc.)

  • (And so on for further nesting, if present)

This enhanced JSON output provides a more complete representation of the document’s structure, including the hierarchical relationships and level of each segment, along with the other helpful information.

So, there are still some organizational issues to work out, but not bad for the initial execution of the idea.

1 Like

I’m actually going further than I intended. My content is semantically chunked according to the document hierarchy as well, but my issue is – what happens if that last chunk exceeds X tokens? How can I semantically “sub-chunk” it? This was the actual problem I was trying to solve.

As you can see from my last test, the parent-child relationship and how to deal with it is an issue. But the context window problem has pretty much been solved by the advance of technology. 128K → 1M token context windows are a thing now.

Looking forward to trying out your API if you get permission to release it.

Solved the parent-child relationship issue this way:

Adding “Has Children” Indicator:

[
  {
    "title": "I. Preamble",
    "title_path": "I. Preamble",
    "level": 1,
    "has_children": "N",
    "token_count": 23,
    "first_few_words": "PRODUCER – SAG-AFTRA CODIFIED BASIC AGREEMENT",
    "last_few_words": "\"Producer\" and collectively referred to as \"Producers.\""
  },
  {
    "title": "II. Witness Eth",
    "title_path": "II. Witness Eth",
    "level": 1,
    "has_children": "N", 
    "token_count": 13,
    "first_few_words": "WITNESSETH: In consideration of the mutual",
    "last_few_words": "agreements hereinafter contained, it is agreed as follows:"
  },
  {
    "title": "III. General Provisions",
    "title_path": "III. General Provisions",
    "level": 1,
    "has_children": "Y", 
    "token_count": 15432,
    "first_few_words": "1. RECOGNITION AND SCOPE OF AGREEMENT",
    "last_few_words": "to coding; It shall study and review the appropriateness of the Section relating to per diem rates;"
  },
  {
    "title": "1. Recognition and Scope of Agreement",
    "title_path": "III. General Provisions - 1. Recognition and Scope of Agreement",
    "level": 2,
    "has_children": "Y",
    "token_count": 438,
    "first_few_words": "A. Recognition The Union is recognized",
    "last_few_words": "Part II shall apply to background actors employed in the New York Zone." 
  },
  // ... (and so on for other segments, with appropriate "has_children" values) ...
]

content_copyUse code with caution.Json

Explanation of Changes:

  • I’ve added a new property, “has_children”, to each object in the JSON array.

  • The value of “has_children” is either “Y” (Yes) or “N” (No) depending on whether the segment has any child segments (subsections or further levels) within the document’s hierarchy.

Logic for Determining “Has Children”:

  • The code would analyze the hierarchical structure of the document.

  • For each segment, it would check if there are any segments at a deeper level within the outline that fall under that segment.

  • If child segments exist, “has_children” is set to “Y”, otherwise “N”.

Benefits of this Addition:

  • Structural Clarity: The “has_children” indicator provides a clearer understanding of the document’s hierarchy and the relationships between segments.

  • Navigation and Processing: It can be helpful for navigation and processing tasks. For instance, you might use this information to determine whether to expand or collapse sections in a user interface or to decide whether to further process child segments.

This enhancement further improves the information provided in the JSON output, making it more useful for understanding and working with the document’s structure.

1 Like

“Ideal chunk” if we can call it like this, is more or less “atomic” - contains one idea at a time, so that your RAG wins in precision when matching vector of a usually short query (often one sentence question) to the vector of a chunk. If chunks are long, they tend to have multiple ideas in them, thus loosing the so much needed precision.

Why they have multiple ideas in them when they are longer than 3-5 paragraphs? - because humans lose their thought map at around 3-5 paragraphs, and their mind start wondering around bringing a bunch of less related ideas.

So 3-5 paragraphs, is largely under the token limit. From my experience I could add that chucks (especially in legal documents) are often closer to 1 paragraph than 3.

The approach I promote, first gets the chunks, then analyzes their purpose, and only then starts building hierarchical relations. Vs build relationships then split into chunks.

With my approach I never run into a bunch of problems resulting from a narrow token window limit. It also gains in speed because I get chunks very early and then the whole thing runs simultaneously as parallel processing.

2 Likes

Here is some sneak peak:

Output JSON: Chunked Document Json - 67 seconds of precessing from raw text · GitHub

Was obtained from raw text here: PPA Believe Raw text (copied as is from PDF) - initially sent to processing as a whole file · GitHub

In 67 seconds.

The outline was built in several milliseconds from the output json and looks like this: PPA Believe Outline (build by arrawy walk recursive on output JSON) · GitHub

The processing cost (what I pay) is around USD 1.20 for that document on OpenAI API and if I used OCR I would add another 60 cents for that.

Edit: I forgot to mention that the models were initially trained in French so it is the first time I launched in English text against them. Looks like it doesn’t care much :wink:

1 Like

That’s a good name: “ideal chunk”.

Clearly, I am doing the latter. And it’s a struggle.

Which is what I am trying to get to in my “ideal chunks”.

Your approach does sound more logical. But the proof, as they say, is always in the pudding. Need to try a typical test document and see what happens.

See my previous post with a stress test results. Only a couple of minor errors because of the language mismatch between training (french) and production (English). So for 60 contracts of training data in a foreign language, I’m happy with the result. Cost will go down as I have more training data

Yes. If I can feed it my document and get this back:

I do believe that is something that I, and a few others, would be willing to pay for.

I am continuing on the development of my “last mile” sub-chunking idea, but very much looking forward to your API as a solution for full documents.

1 Like

It’s a solid solution for sure. Personally though I would chunk some of the information slightly differently.

Take the below as just one example: I would treat all the information as a single semantic chunk as opposed to the individual line items as otherwise the context and relationship between the information might get lost. Perhaps you consider this as part of your next step but I am hoping to work towards a solution where this happens largely in a single integrated step.

{
                    "index": 3,
                    "title": "1.1 Personal Data",
                    "name": "",
                    "content": "",
                    "type": "container",
                    "path": "000:001:003",
                    "children": [
                        {
                            "index": 0,
                            "title": "",
                            "name": "Definition of Personal Data and Scope of Information Collected",
                            "content": "Personal Data means information that directly or indirectly relates to You as an identified or identifiable natural person. This may concern, depending on the contract, the Sites, the Products or Services, Your status and\/or the means of collection, all or part of the following Personal Data:",
                            "type": "container",
                            "path": "000:001:003:000",
                            "children": [
                                {
                                    "index": 0,
                                    "title": "",
                                    "name": "Categories of Personal Data: Individual's Name Details",
                                    "content": "- Name(s) and surname;",
                                    "type": "body",
                                    "path": "000:001:003:000:000",
                                    "children": []
                                },
                                {
                                    "index": 1,
                                    "title": "",
                                    "name": "Postal Address Requirements for Invoicing or Delivery",
                                    "content": "- Postal address (invoicing or delivery);",
                                    "type": "body",
                                    "path": "000:001:003:000:001",
                                    "children": []
                                },
                                {
                                    "index": 2,
                                    "title": "",
                                    "name": "Contact Phone Number Requirement",
                                    "content": "- Landline or mobile (personal or professional) phone number;",
                                    "type": "body",
                                    "path": "000:001:003:000:002",
                                    "children": []
                                },

...

For the me value of AI comes in recognizing where information should be grouped as a logical unit. While I agree there should be constraints as to the size of the chunk, in practice you will have quite a few variations. Sometimes, there is value in grouping content from different hierarchical levels together to maintain information coherence and context.

In any case, I look forward to build on the testing by @SomebodySysop with my documents. Your outcomes look great already. I have some additional ideas and thoughts and will report back here once I have had some time to play this through.

1 Like

So, where is where I am so far. I’ve come up with an instruction prompt and an output that gives me a JSON file that theoretically will let me create my sub-chunks. You will note that the main difference between this and @sergeliatko 's output is that his will include the content already chunked. For my solution to do that, you’ll need to be working with a context window of 1M tokens (assuming you are working with fairly large documents).

But, as I’ve stated before, I see my solution as more of a “last mile” effort to semantically sub-chunk the chunks that have initially been semantically chunked – somehow I need to figure out a better way to say that.

Task: Analyze the following document and generate a nested JSON representation of its hierarchical structure.

Document:

(Paste the full text of the document here)

Output Format:

The JSON output should be an array of objects, where each object represents a segment of the document’s structure and has the following properties:

  • title: The title of the segment, including any preceding numbering or lettering (e.g., “I. Preamble”, “A. Recognition”).
  • title_path: The full hierarchical path to the segment, with each level separated by " - " (e.g., “III. General Provisions - 1. Recognition and Scope of Agreement”). The title_path property should indeed include the full hierarchical path, incorporating the titles of all parent segments.
  • level: The level of the segment in the hierarchy (1 for main sections, 2 for subsections, etc.).
  • has_children: “Y” if the segment has child segments, “N” otherwise.
  • token_count: The approximate number of tokens in the segment’s text.
  • first_few_words: The first few words of the segment’s text (at least 5 words).
  • last_few_words: The last few words of the segment’s text (at least 5 words).
  • children (optional): If the segment has child segments, this property should contain a nested array of child segment objects following the same format.

Additional Instructions:

  • Use a reliable tokenization method to determine token counts.
  • Ensure accurate identification of segment boundaries based on the document’s headings and structure.
  • Pay attention to compound words and special characters during tokenization.
  • Maintain the hierarchical relationships between segments by nesting child elements within their parent segments.

Example Output:

(Provide a small example of the expected JSON output, similar to the one in the previous response)

	[
	  {
		"title": "I. Preamble",
		"title_path": "I. Preamble", 
		"level": 1,
		"has_children": "N",
		"token_count": 23,
		"first_few_words": "PRODUCER – SAG-AFTRA CODIFIED BASIC AGREEMENT",
		"last_few_words": "\"Producer\" and collectively referred to as \"Producers.\""
	  },
	  {
		"title": "II. Witness Eth",
		"title_path": "II. Witness Eth", 
		"level": 1,
		"has_children": "N",
		"token_count": 13,
		"first_few_words": "WITNESSETH: In consideration of the mutual",
		"last_few_words": "agreements hereinafter contained, it is agreed as follows:"
	  },
	  {
		"title": "III. General Provisions",
		"title_path": "III. General Provisions",
		"level": 1,
		"has_children": "Y",
		"token_count": 15432,
		"first_few_words": "1. RECOGNITION AND SCOPE OF AGREEMENT",
		"last_few_words": "to coding; It shall study and review the appropriateness of the Section relating to per diem rates;",
		"children": [
		  {
			"title": "1. Recognition and Scope of Agreement",
			"title_path": "III. General Provisions - 1. Recognition and Scope of Agreement", 
			"level": 2,
			"has_children": "Y",
			"token_count": 438,
			"first_few_words": "A. Recognition The Union is recognized",
			"last_few_words": "Part II shall apply to background actors employed in the New York Zone.", 
			"children": [
			  {
				"title": "A. Recognition",
				"title_path": "III. General Provisions - 1. Recognition and Scope of Agreement - A. Recognition", 
				"level": 3,
				"has_children": "N",
				"token_count": 287,
				"first_few_words": "The Union is recognized by Producer as",
				"last_few_words": "and body doubles. Background actors are not considered \"performers.\""
			  },
			  {
				"title": "B. Scope", 
				"title_path": "III. General Provisions - 1. Recognition and Scope of Agreement - B. Scope", 
				"level": 3,
				"has_children": "N",
				"token_count": 151,
				"first_few_words": "(1) When Producer has its base of",
				"last_few_words": "Las Vegas, Sacramento, San Diego, San Francisco and Hawaii Zones. Only the provisions"
			  }
			]
		  }
		  // ... (and so on for other children) ... 
		] 
	  } 
	  // ...
	]

I’ve not fully tested yet, and I imagine there might be even further changes. but this continues along my current thought path of making this work with a single API call, followed up by local code that will read the JSON and chunk the document.

1 Like

Each container in a logical unit eg : 000:001:003:000

Once the structure is done it’s up to you to define the level of precision your app needs to operate. To do so you simply walk from the leaves toward the root.

Also, just to be clear for new readers here: the thing is not cutting at the line break or sentence level.

It isolates text containing a single/ “atomic” idea (but is configured not to split composite sentences, even if the first version was doing it).

So if you have a long paragraph, contains several ideas in it - the tool will break it down in several paragraphs.

Same applies to merging short paragraphs together if they are a part of a same idea.

I understand the hierarchy part which I am also doing and I call it Parent retriever method. What I do not understand the first and last few lines. How will the context of the question be understood with first and last few lines only. The approach that is working for me better than other approaches is as below.

I take the PDF, convert into markdown and then split by the headers (#, ##, ### …). I chunk each header and if the chunk is more than 300 tokens, I chunk it further. For each chunk, I store the whole content of that section a separate metadata table.

While matching the question, I match it with the chunk embedding, but after that I retrieve the section content of the chunk and give to the LLM.

Here is how the tool would split the quote above:

I understand the hierarchy part which I am also doing and I call it Parent retriever method. 
What I do not understand the first and last few lines. How will the context of the question be understood with first and last few lines only. 
The approach that is working for me better than other approaches is as below.

- I take the PDF, convert into markdown and then split by the headers (#, ##, ### …).
- I chunk each header and if the chunk is more than 300 tokens, I chunk it further. 
- For each chunk, I store the whole content of that section a separate metadata table.
1 Like

Once you have the structure as JSON, nothing prevents you to do the following:

  • Embed each “leaf” as separate vector,
  • Also walk from the leaf toward the root and embedd each section as a separate vector (after using print children function to get the full content of the section)

As a result you’ll end up with more vectors than chunks being different length (level of precision).

When matching the query vector to your store vectors, longer queries will tend to match sections better, short queries will more likely hit leaves of the hierarchical tree.

As example question: do they collect email address? Will have closer match with the chunk: “- email address” in the “collected data section”.

And an example of “collected data” clause will be likely closer to the full section “collected personal data”.