Using gpt-4 API to Semantically Chunk Documents

If the goal is semantic chunking, using any tech out there, then there are other approaches possible, and these are a lot cheaper and sharper.
We use layout parsing models to recognise paragraph/subpara headers. OCR is part of the pipeline- we get scanned docs too. The models are fine tuned to recognise these headers. Simple code finds the text between two headers as the “value” of the paragraph/subpara. Metadata includes auto-numbering sections, lines as well as in-doc numbering if found. Page numbers too.
We anyways had an in-house doc extraction idp tool, so this was a off-shoot. It works pretty well with almost any para-based document layout.

1 Like

Hi there - part of the intention is to also identify semantic chunks within longer sections of a document, which in some cases don’t have further sub-headers or other ovious demarcations. Using AI models to identify appropriate delineations can be quite useful.

The other benefit of using an AI model is the flexibility is provides compared to more rules-based approaches, especially when working with a high diversity of documents.

1 Like

Apologies this time for the silence on my end - it’s been a long week.

Over the weekend, I finally had some time to further test and refine the approach. Based on some more extended test results, I have further streamlined step 1, i.e. the creation of the outline. I am now working exclusively with start line numbers and it appears to be working very well. In particular, for some documents I noticed that there was a risk of overlaps in start and end line numbers across sections. The shift towards using only start line numbers to demarcate sections mitigates that. Testing also showed that the wording of the prompt for the initial outline creation has a significant impact, resulting in further refinements.

Testing also made clear that I need to re-work the approach for footnotes. My initial solution did not hold up in practice. But I’m optimistic that I can figure out something eventually. For now they are just extracted along with the rest of the text.

I’m about to run the outline creation step over multiple hundreds of very different documents. That will hopefully give some more nuanced insights into the remaining issues but like you said, overall we are already in pretty good shape.

The other positive side benefit has been that I can use this approach as the new backbone for a long document summarization tool, of which I had created a prototype before.

2 Likes

Agreed.

Further agreed. The only problem I’ve run into so far is the model context window when prompting it to create the hierarchy. That sort of limits the size of the documents I can work with. But, if I do an initial hierarchal chunking resulting in smaller sub-documents (less than 100 pages), then it looks like we’re in business.

1 Like

I figured I’d chime in after @jr.2509 pointed me towards this thread, thanks again @jr.2509.

For context; I am working on an application that takes a (unstructured) resume in pdf and a docx template containing placeholders. (e.g. {firstname} or {jobexperience.startyear}. The app then creates a copy of the template, but replaces all placeholders with the data from the pdf resume. End result: similar looking resumes + structured data you can then play with (e.g. translate etc.)

This is my workflow:

  1. Extract text from pdf
  2. Prompt Open AI

System prompt: You are a information retrieval machine that takes an unstructured resume, structures it according provided json schema, and outputs json only

User prompt: Generate a structured representation of a resume. The absence of information must be indicated with ‘#####’. Handle multiple data occurrences as arrays. The structure should include the following fields and adhere to the following json schema

Then followed up by the desired JSON schema. I include nested objects for e.g. certification start year, intitute, etc. Example:

"jobexperience": [
        {
            "company": "[Company Name]",
            "place": "[Location]",
            "jobtitle": "[Job Title]",
            "description": "[full description of the job]",
            "toolsandtechniques": "[Tools, Techniques, Skills, Environment]",
            "startyear": "[Job Start Year]",
            "endyear": "[Job End Year]"
        }
    ]
}

For the entities that it does not always get 100% I include samples to apply few shot prompting. I have 2 examples for what the input resume work experience might look like and how the structured representation should end up. This works great and might help you with defining ‘logical units’ @jr.2509 ! I also should implement one for languages, because on some resumes people specify how good they are at different languages, and the LLM then sometimes creates a new nested object (and my app expects just a key/value pair “languages” downstream). So that will be something like:

Example 1:
Languages: English, Dutch, Hindi

Structured representation:
{
    languages: "English, Dutch, Hindi"
}

Example 2:
Languages
English: mother tongue
Japanese: A2
Chinese: Written proficiency

Structured representation:
{
    Languages: "English, Japanese, Chinese"
}

My current challenge is extremely large resumes (some about 6k tokens). Then the LLM does reply with a working JSON object, but it shortens work experience and sometimes profile description too in order to fit inside the token limit.

I am looking into splitting, but cant get recursive text splitting to work well with my current setup (Buildship normalizes linebreaks doh).

The next thing I want to try is (this was mentioned earlier) to include first and last sentences of work experience (as this is what contains the most tokens) and programmatically reconstruct the full work experience. Its funny because thats what I used to do a few months back in order to save cost, but abandoned the idea. Might go back to it. Probably with some few shot prompt examples to make sure the first and last words are unique (sometimes people describe different job experience the same way).

The linenoXXX approach is very interesting too, but I feel like it underutilizes the power of LLM’s. Just my thoughts.

Will keep you guys posted. If you have any other ideas to bypass the token limit, let me know.

Grtz

1 Like

I agree, but I and @jr.2509 discovered, the hard way, that the models can be pretty unreliable at accurately recreating the first and last sentences – not to mention the regex issues when there are errant spaces, linefeeds, tabs, etc… Using the line numbers has proven to be the most accurate and consistent way to identify blocks of text – at least so far.

1 Like

Considering my usecase above, would you recommend converting the resume string to a linenoXXX converted list and have the LLM return the line numbers of the start and end of each work experience?

And did you guys test it with temperature 0 + few shot prompting?

I’m a bit short of time tonight but started to think about your case and will do a bit more thinking in the next 1-2 days.

But a few points:

I rely on a zero-shot approach in my case and so far this has worked very well.

In your case, I see an added challenge. Assuming that the resumes in question all have slightly different structures, you also need to classify your text into a pre-defined category in line with your desired JSON structure (at least this is how I understand it based on your examples). So while I think you can try to replicate the basic steps that @SomebodySysop and I have outlined and that have worked well for us, you will likely have to also ask the model to label the extracted text parts in accordance with your categories and then add that information to the JSON and use it later as part of the post-processing. In principle this should be possible. To achieve good rests for the category labeling, a one or few shot approach may be more reliable but I would test it. As it’s not rocket science, the model may do a decent job under a zero shot approach as well.

Btw, feel free to correct me in case I am misunderstanding anything about your specific case.

I always use 0 temperature. As for the prompt, I basically explain everything I want done, and give one example. Right now, this works for one type of document – basically legal agreements. However, I am wanting to use this process for other types of document structures (Bible, Talmud, Tanakh) as well as freeform text (like chapters from a book). So, I’m not sure if I should have a different prompt for each type of document, or one prompt with multiple examples. Got to test that out.

In my opinion, that would be the least error-prone way to do it. You can see early in this thread that this was an issue for everyone. But, as usual, test to see what works best for you.

1 Like

Here’s a dime-a-dozen idea:

Walk through the context of a document, with multiple calls.

self-documenting system message:

// task

Identify logical split points in document sections, so that they can be effectively chunked for a document search function.

  • You have received a new section from within a document, starting from the last point where it was split, with line numbers prepended.
  • Hinting is provided that gives an ideal chunk size, but you can split before that if there is a clear new section topic or different facts.
  • Hinting explanation: of an ideal 500 token target length, [hint start] marks the 300 token point, and [hint end] marks the 700 token point. 1000 total tokens are provided.
  • Literature and fiction may have no obvious sections, but still must be split logically.
  • An auto-generated summary of the document start is provided that may give an overview of the document purpose.

// response in json

split_at_line_numbers: array of numbers // can be up to 4 points to split
split_at_text: array of strings // the last five words of a section within the line numbers to split at
section titles: array of strings // short description of text before first split, plus addition entries if more than one split_at_line_numbers value, describing text between additional values
error reason: string // optional; only to be used in case of extreme inability to comprehend document input, or no input

// role

AI is a backend processor, with no user to interact with, and one job.

Operating on short context may provide higher attention and reasoning.

1 Like

So, this is the json file I end up with in order to do my chunking:

https://s3.us-west-2.amazonaws.com/docs.scbbs.com/docs/test/article11-out.json

Chunking so far works perfectly using the generated start and end line numbers.

However, I do have situations where the token count exceeds the limit (in this case 600) I have set.

This means I need to run ANOTHER prompt to semantically chunk it down even further (If I still don’t want to do the numeric chunking).

           {
                "title": "E. ARBITRATION OF DISPUTES CONCERNING CREDIT PROVISIONS",
                "level": 2,
                "token_count": **1355**,
                "start_line": "line0267",
                "has_children": "N",
                "children": [],
                "end_line": "line0326"
            },

https://s3.us-west-2.amazonaws.com/docs.scbbs.com/docs/test/13_Article_11_1.pdf#page=14

                    {
                        "title": "5. Selection of an Arbitrator; Place of Hearing",
                        "level": 3,
                        "token_count": **767**,
                        "start_line": "line0375",
                        "has_children": "N",
                        "children": [],
                        "end_line": "line0444"
                    },

https://s3.us-west-2.amazonaws.com/docs.scbbs.com/docs/test/13_Article_11_1.pdf#page=18

                    {
                        "title": "6. Timeline; Citation of Expedited Arbitration Awards",
                        "level": 3,
                        "token_count": **714**,
                        "start_line": "line0445",
                        "has_children": "N",
                        "children": [],
                        "end_line": "line0458"
                    },

https://s3.us-west-2.amazonaws.com/docs.scbbs.com/docs/test/13_Article_11_1.pdf#page=20

I am thinking, at this point, I could send the model just the chunk with the line numbers, and ask it to return a json of just the suggested semantic sub-chunks of the submitted chunk.

I don’t want to re-do what I’ve already done which works. I just want a way to address this particular issue which may/may not arise.

Any prompt suggestions? I have my own ideas, but interested in how others would approach this.

So you want the chunks to be max 600 tokens? I think LLM’s are still notoriously bad at math. Not sure what version of GPT you are using (I think 4 is a bit better), but are you sure the token sizes are actually correct? I am not convinced prompting an LLM is the best approach to count token size. I think the LLM might understand max no. of lines better, since that will probably be a occuring more often in its training data.

Something that comes to mind is to calculate the average token size per line of the document, then using the difference between start / end lines to calculate an approximation of the amount of tokens (or use a general estimate). Then use a max number of lines in the prompt.

Is making an assistant with a token counter function an option (not sure if possible)?

They are. But, they are much better at counting tokens, as they must keep track of total input and output tokens processed during each call.

However, my goal is NOT to have the model determine the chunks by token size. My goal is to have the model determine chunks semantically that DO NOT EXCEED a particular token size. Not the same thing.

Right. Can you instruct the LLM to split up the chunk in a nested object with multiple chunks once the chunk is bigger than a certain threshhold? E.g. ‘if the chunk size exceeds 500 tokens, make a nested object with sub chunks of max 500 tokens that together make up the chunk’.

An OpenAI developer in this forum issued a maxim a few months ago:

Thou shalt not use an LLM when traditional code can be used to accomplish the task.

There is no point in expending tokens to chunk a document by x tokens when when this can be done easily with one line of code.

What I want to accomplish, and what is the subject of this entire thread, is to chunk text by it’s semantic meaning, it’s individual concepts and/or ideas.

What we have done so far is an hierarchal breakdown of a document by it’s semantic structure. Now, what I am asking is if, after this breakdown, I still end up with a chunk of text which still exceeds x tokens, how can I prompt the model to break it down even further based upon what it is saying, not based upon it’s size?

This is something that cannot be accomplished with code alone.

For the second step of dealing with the further breakdown - are you looking to do this in a second API call or try to combine with the first?

I also assume your documents do not have any further sub-sections in those longer chunks, right?

I would like to have it done in the first call, but it is possible to have a long section of text that, according to the first prompt, would not have any children. This is the case where we I think we’d need a second prompt. Hiearchally, it does not have any further subsections, but still needs to be broken down by ideas/concepts.

Actually, in the cases that I cited here: Using gpt-4 API to Semantically Chunk Documents - #52 by SomebodySysop, they all have sub-sections. Not sure why they weren’t broken down by the first prompt, but the 2nd prompt would then act as an insurance policy to make sure all sections are broken down to the token limits.

Again, I see this second prompt as more of a “semantic” breakdown than “hierarchal” as well as a final “catch-all” for items that aren’t broken down sufficiently by the first prompt. It is only triggered when a chunk exceeds the token limit.

Langchain has a semantic text splitter that might be worth looking into.

1 Like

See in the following my latest prompt plus the JSON schema I am using as part of it. Given your methodology is sligthly different, you have to make some adjustments I believe. That said, with this approach I have gotten fairly nuanced breakdowns in my documents.

Can’t guarantee it will work in your case but for what it’s worth…

What I’ll add is that my results improved the less I asked the model to do. For example, once I dropped the requirement for also including the end line number, the model ended up having more capacity to focus on the core task of identifying the semantic sections. For me that was part of the rationale to focus purely on the start line number.


Prompt

messages = [
                    {"role": "system", "content": "Your task is to identify the main sections and sub-sections in the provided document text and return the information in the prescribed JSON format. You must identify sections down to the lowest hierarchical level. Where in place, you use the table of contents (or similar) as a reference point for identifying relevant sections and sub-sections. Your response must strictly include all sections and sub-sections. You do however not include the table of contents itself."},
                    {"role": "user", "content": f'JSON Schema: {json_outline}, Document text: {sentences_data}'}
                ]

Example JSON schema

{
    "hierarchy_level": 0,
    "title": "Document title (verbatim)",
    "children": [
        {
            "hierarchy_level": 1,
            "title": "Section title (verbatim, include preceding letter, number or roman numeral if applicable)",
            "start_line":"number of the line where the section starts based on the unique line identifier in the JSONL file with the document sentences",
            "children": [
                {
                    "hierarchy_level": 2, 
                    "title": "Sub-section title (verbatim, include preceding letter, number or roman numeral if applicable, N/A if no title)",
                    "start_line":"number of the line where the sub-section starts based on the unique line identifier in the JSONL file with the document sentences",
                    "children": [
                    {
                        "hierarchy_level": 3, 
                        "title": "Sub-section title (verbatim, include preceding letter, number or roman numeral if applicable, N/A if no title)",
                        "start_line":"number of the line where the sub-section starts based on the unique line identifier in the JSONL file with the document sentences",
                        "children": []
                    }
                    ]
                }
            ]
        }
    ]
}
2 Likes