Using gpt-4 API to Semantically Chunk Documents

sergeliatko · May 22, 2024, 11:43am

The subject change I’m talking about is as example below of how model can split your previous message in elements:

Common chunks:

Look, again I think in general I tend to agree with you.
I look at this from a very practical point of view - under consideration of the types of documents I am dealing with, which are not legal contracts.
Often you have the case that a title has an important informational value and serves as context to interpret the information in the section body.
– For example, you may have situations where a similar topic is discussed at different points in a document body.
– However, the context within which it is discussed my differ and you may only be able to discern the difference with an understanding of the section titles where the information are located.
So this is for me where I see the value of having an understanding of the layout and its role in the logical structure and interpretation of information.

Parent chunks would be (titles are shortened, just to give the idea):

Quote

agreement to the point
difference in use case
case of titles
value of layout

Case of titles

special case of titles
example
inability to discern the difference

You had no titles in the message, but the change of the subject allows to break it up in elements (not only one sentence as in your text), then group them into blocks. And potentially if needed define the purpose/title/outline/metadata etc

sergeliatko · May 22, 2024, 11:52am

That’s why the chunk should contain its title, often the title of the parent, also it must have its outline and some other elements to be retrivable.

But you can understand what is a title without formatting, here is an example,

… explore the top 15 extraordinary real-world applications of AI that are driving change and revolutionizing industries this year. Healthcare AI has made significant strides in healthcare this year by improving diagnostics, enabling personalized medicine…

“Healthcare” is a title hidden in the raw text. And you spot it regardless the formatting even in a single line, so if you text also have line breaks, spotting titles is even easier. But in the engine we build, we presume the line-breaks are broken, so we trained a model to fix them as a part of identifying the purpose of text blocks (title/text/meta/etc.)

SomebodySysop · May 22, 2024, 8:31pm

I think @jr.2509 has done a good job of making the case for my methodology. While it does not always result in a single “atomic idea” per chunk, it does present full semantic ideas – as intended by the document author – within each chunk.

Yes. They are generally represented by the titles which will contain the titles of all parent segments. You can see these are identified in the below examples:

chunk hierarchy json file:
https://s3.us-west-2.amazonaws.com/docs.scbbs.com/docs/test/article11-out.json

chunk output:
https://s3.us-west-2.amazonaws.com/docs.scbbs.com/docs/test/output.txt

I can optionally include in each chunk a summary of it’s parent document – in my testing this has the effect of “connecting” chunks of the same document in my near text searches.

This is a flowchart of my methodology which, in my opinion, renders the positive effects of both “atomic idea” and “layout-aware” concepts with respect to embedding generation.

Note that I now will semantically sub-chunks any chunks that exceed my given chunk token size limit.

I have yet to test this out in production, but I’m working on it!

thinktank · May 23, 2024, 12:46am

That’s just the coolest sentence I’ve read all day. What happens if there’s a contradictory statement, or the idea itself is contradicts the whole? Like in the case of a lie in a document or a conversation? A piece of deliberatey contradictory information.

I think your observation of “reliable data patterns” in data is farsighted.

I also think it’s difficult to discover the any atomic idea of a document, but that, the more the document is thought out, and the more structured it’s layout, the easier it is to identify the atomic idea.

In Contracts, there’s usually a clear hierarchy. Long-winded but perfectly understandable sections. Websites structure with html. Financial Statements follow a pre-defined form. Not to say that there aren’t anomalies, just that, by and large, at a glance, these documents have a clear structure that is, and has been, standardized for decades and/or millennia.

Then maybe you have something like an Epic Poem, Shakespeare, or a television show. They have reliable structures, but to understand them you have to read more carefully. There might be obvious semantic clues, like incremental repetition. “When young dawn with her rose red fingers rose once more” is a frequently repeated phrase in the Odyssey, for example, that can usually be found at the beginning of a chapter.

Then you have conversation, like this:

Blockquote
"… but her words, every body’s words, were soon lost under the incessant flow of Miss Bates, who came in talking, and had not finished her speech under many minutes after her being admitted into the circle at the fire. As the door opened she was heard,—
“So very obliging of you!—No rain at all. Nothing to signify. I do not care for myself. Quite thick shoes. And Jane declares—Well! (as soon as she was within the door), well! This is brilliant indeed! This is admirable! Excellently contrived, upon my word. Nothing wanting. Could not have imagined it. So well lighted up! Jane, Jane, look! Did you ever see any thing? Oh! Mr. Weston, you must really have Aladdin’s lamp. Good Mrs. Stokes would not know her own room again. I saw her as I came in; she was standing in the entrance. ‘Oh! Mrs. Stokes,’ said I—but, I had not the time for more.” She was now met by Mrs. Weston. “Very well, I thank you, ma’am. I hope you are quite well. Very happy to hear it. So afraid you might have a headache! seeing you pass by so often, and knowing how much trouble you must have. Delighted to hear it indeed—Ah! dear Mrs. Elton, so obliged to you for the carriage; excellent time; Jane and I quite ready. Did not keep the horses a moment. Most comfortable carriage. Oh! and I am sure our thanks are due to you, Mrs. Weston, on that score. Mrs. Elton had most kindly sent Jane a note, or we should have been. But two such offers in one day! Never were such neighbours. I said to my mother, ‘Upon my word, ma’am.’ Thank you, my mother is remarkably well. Gone to Mr. Woodhouse’s. I made her take her shawl,— Mrs. Dixon’s wedding present you know; Mr. Dixon’s choice. There were three others, Jane says, which they hesitated about some time. Colonel Campell rather preferred an olive.—My dear Jane, are you sure you did not wet your feet? My dear Jane, are you sure you did not wet your feet? It was but a drop or two, but I am so afraid: but Mr. Frank Churchill was so extremely—and there was a mat to step upon. I shall never forget his extreme politeness. Oh! Mr. Frank Churchill, I must tell you my mother’s spectacles have never been in fault since; the rivet never came out again. My mother often talks of your good-nature: does not she Jane?..” — “Emma,” Volume 3, Chapter II, Jane Austen

This goes on like this for another page.

Where is the central idea in that? It just depends how you look at the passage. On one level, the atomic idea here is “Miss Bates gets sat at the fireplace during the Weston’s ball.” But on another level, it’s a brilliant recording of human conversation, very true-to-how-people-actually-talk, which is hard to do as a writer. (Try reading it aloud. It’s brilliant. I think this is one of the best written passages in all of Western Literature.) On yet another level, there are some interesting clues to the overall plot which are so casually dropped in the middle of all that delightful nonsense that its easy to miss…unless you pay attention to what she actually says… and know the whole story. This latter gets at the Purpose of the writer, “What did Jane Austen intend by including this section.” The central idea changes with each perspective.

So I think it’s an excellent idea to keep underlying data patterns separate from layout, but could you use document layout to give an intelligence layer a clue on how to analyze the document? Could there be a step that adds the “layout” as meta data.

For example, the layer reads a document—you can usually determine from the first few pages what type of document it has, and whether it has some predefined structure. Say you’re working with a single law firm, and they have a type of standardized form they always use. In this case, in recognizing the pre-defined structure, the model can identify the layout and spend less time looking for Purpose.

But what if it comes across conversation, like the above? Perhaps a long, wandering and rambling deposition where—accidentally on-purpose the person let’s slip some juicy tidbit that seems irrelevant at first glance. Perhaps in this case, the model identifies that this is indeed “long-rambling conversation” and pays more attention to underlying meanings where Purpose might be harder to identify, or even intentionally obfuscated by the speaker.

As you say, Order is the first thing you look at. What if there is something that is input later that would only make a conversation that was input earlier make sense with the later input in mind? So, since the conversation was added to your database before the next piece of information, would the model think to check something that seemed completely non-sensical the first time it looked it over?

SomebodySysop · May 23, 2024, 1:36am

I take the concept of “atomic idea” to be akin to the way atoms are the building blocks of matter, atomc ideas are the building blocks of documents. Essentially, the atomic idea is the core idea of a sentence, paragraph, or in our case, chunk.

If you look at the hierarchy flowchart I posted, here is where the “semantic chunk” section is deployed – to break that long-winded text down to it’s individual core – “atomic”, if you will – ideas expressed in each sentence / paragraph.

thinktank · May 24, 2024, 5:21pm

I certainly agree that there is an atomic idea in every semantic chunk. The idea is the “thing that holds the chunk together.”

But I also think that there is an atomic idea for an entire document. That idea is the glue that holds the semantic structure of the whole thing together.

To my mind, the purpose here—in reconstructing how the human mind understands multi-dimensional understanding, and trying to emulate that with AI—is to relate the Central Unifying Idea—“The Atomic Idea”—of the entire document to that of each individual chunk. Without understanding that Idea, there is no way to re-assemble the concepts in the document. It is the reason the document was created: “the Purpose of the Document.”

Ayn Rand and Aristotle were big proponents of this “Central Idea.” I direct you to Aristotle’s theory on the Immovable Movers, Ayn Rand’s “The Fountainhead,” Plato’s “Theory of the Forms,”, and I think Jung’s Theory of the Archetypes is also important for a more thorough discussion of how archetypal unifying concepts ‘hold things
together.’

If you take that big long winded speech from Miss Bates down to the atomic ideas contained within each individual sentence, you miss the broader significance of including that paragraph in that section of the book.

There’s really no “central idea” contained in that short blurb. There is not enough context to know how it works in the world other than to know we’re at a party and someone is talking “incessantly.” And, conversationally, each sentence contradicts the last. It’s a rhetorical mess! There’s not much Purpose behind what Miss Bates is saying, but there is Purpose in why that information is there.

Maybe, for super-long documents or documents with complexities like literature, there could be some type of “Hypothesis” meta-field. A field that stores the model’s working theory about what the Document’s Atomic Idea is as it parses individual chunks. Perhaps certain hierarchical levels of smaller sections in the document also have Hypothesis Fields, where the model determines relatedness to the Central Idea.

As it reads the document and new premises are added, it updates it’s Hypothesis as needed. It continues to update the hypothesis until it feels it is conclusive.

This final meta-field, the Conclusion Field, I think, would have a “High degree of cosine relatedness” with smaller atomic ideas within the document, thus allowing it to recreate (understand) the original ideas contained within.

This whole idea of reiteration over a document wouldn’t be necessary for something with an easily identified layout—only for documents that potentially have multiple levels of meaning and require deeper understanding.

So, back to the conversation on layout, if you identify a document as a financial statement—there’s probably no need to iterate over it more than once. It’s a computational waste of time. But, if you identify a document as literature or conversation, something that might contain hidden or not-easily-identified meaning, the model will flag it for this process.

SomebodySysop · May 24, 2024, 7:51pm

Thank you. Now I understand what you are saying, and I agree. I constantly warn my users that while these models are amazing at finding “needle in haystack” concepts, they aren’t very good at determining “The Big Picture”.

One of the datasets I deal with involves religious scriptures. I also warn my users not to press the models for their “hypothesis” on the meaning of these texts because they aren’t human and don’t have any concept of the real-world. It is for this reason that I include the works of theologians to be used to form “conclusions”.

You present an interesting concept that should be considered. Right now, I’m trying to establish the atomic idea at the micro level. Next up: determine the atomic idea at the macro level.

SomebodySysop · May 28, 2024, 7:25am

This is a progress report. I am finally starting to implement this methodology, one baby step at a time. I thought I would share some notes I have made on the process in case it might help others.

First off, this is the process flow: https://us1.discourse-cdn.com/openai1/optimized/4X/4/9/b/49be57bfa47aaa66a21ab627ff894565d373af6e_2_281x500.jpeg

It is the same overall process we’ve been working on since this post: Using gpt-4 API to Semantically Chunk Documents - #35 by SomebodySysop

So far, everything is implemented EXCEPT the actual physical chunking of the segments. And, so far, it’s working as expected.

I tested with a non-legal agreement document today - A rambling sermon by a well-known Mormon theologian: A Court Trial - 64-0412 - Sermon preached by William Branham

This source is an HTML file, not a PDF, so I had to do an html-to-text extraction - replacing paragraph tags with line feeds and removing all the other tags.

After running it through my process, here is the json hierarchy output, ready to be chunked: https://s3.us-west-2.amazonaws.com/docs.scbbs.com/docs/test/node-5979-json.txt

First off, my advice to anyone doing this is to take each step at a time, and build in a failsafe at each step so that if it fails, you automatically revert back to your default chunking method that you know will work.

Overall, this is working well. The most difficult part has been getting the prompt correct.

Biggest issues:
- titles. The document I tested with A Court Trial - 64-0412 - Sermon preached by William Branham does not have actual segment titles (like the legal agreements), just numbered sections, so the model would either return the first sentence following the segment number: https://s3.us-west-2.amazonaws.com/docs.scbbs.com/docs/test/node-5979-json.txt or just the segment number:https://s3.us-west-2.amazonaws.com/docs.scbbs.com/docs/test/node-5979-json+02.txt
  - Telling it to return the first x words didn’t work. Telling it to return a maximum of x tokens in title worked better, but has mixed results.
- output token limit. Because the model was initially returning the entire first paragraph of each segment, it was exceeding it’s output token limit (GPt-4o limit = 4K tokens and Gemini 1.5 Pro limit = 8K tokens). When it does this, it simply returns as much of the json file as it can and then stops. Need to find a way to make it either return the entire json file, or if it can’t, then return an error.
- json formatting.
  - Getting the json formatting correct was a bit challenging as well. Essentially trying to figure out when to json_encode and json_decode. I’ve added a json format test so processing will not continue if the json hierarchy file is not correctly formatted. This includes if a full json array is not returned (in the case where the model stops generating it before completion).

I still need to test with a few different formats, but overall, I am getting the results I hoped for: Each document broken down hierarchally and chunked based upon the semantic intent of the author as presented in it’s layout. Essentially, “atomic idea” as opposed to “sliding window”.

I still need to add that additional semantic sub-chunking subroutine to handle the cases where the hierarchal chunk exceeds the token limit, but that’s looking easier and easier as I progress.

So far, so good.

IntelliJJ · May 28, 2024, 12:20pm

Are you not prepending the line numbers anymore? If it just returns start and end lines you shouldnt run into any output token limitations, thus improving reliability (no more unfinished json objects).

The object returned by open ai api includes a stop reason. You can use that for error handling (it will be ‘length’ if it stopped because of token limit, see docs here: https://platform.openai.com/docs/guides/text-generation/chat-completions-api).

IntelliJJ · May 28, 2024, 2:44pm

Btw, I am curious what model you guys settled on. I just tested 4o vs 4-1106 (both 0 temp), and 4o performed better at classifying text, getting the right key/value pairs returned, etc. On the other hand, 4-1106 sometimes included things 4o did not include, but should have.

Now that I think of it, maybe I should run 4o not at 0 but 0.1 temp for best results.

thinktank · May 28, 2024, 3:33pm

Ohhh! You’re right! Dang. Hadn’t thought of that. The process I outlined would definitely only be applicable in cases when matters of belief are not being discussed.

Though, I think when you actually store information including a self-updating hypothesis meta field about “what the model thinks the text is about” is a different task from “what the model believes about the text it’s reading.” So it can still parse a spiritual text without commenting on it’s own beliefs or whatever.

But, when retrieving information on a spiritual topic, perhaps it would be best to retrieve information from a wide variety of similar scholars on the topic. So, in your next post, you talk about a long rambling sermon from a mormon (which is an excellent choice for long-and-rambling rhetoric with a variety of hidden meanings) you might also return opinions from other religious in the field and bypass having the model “express beliefs” it doesn’t have.

Hey mate, in looking over that document, you’re right, it doesn’t have any titles. But, since it’s a sermon mocked up as a trial lawyer, it does have rhythm. Since it’s a literal speech, the rhythm is probably related to iambic pentameter (which is the rhythm the most-like spoken English). The fact that it was literally spoken is important. This isn’t something rambling by Emmerson who was writing and never had to pause for breath.

The “title” of each section is contained as semantic clues, usually after the first sentence, which is presented in a rhetorical style.

So, perhaps instead of looking for the title in the first sentence, or a maximum of x tokens, can you somehow:

Identify the layout of the Document, in this case a spoken sermon.
From the first few paragraphs, have the model deduce the speaker’s style—i.e. He usually states his point in the second sentence, or x number of tokens using a fairly standard rhetorical rhythm.
Use that rhythm as the rule to identify “titles” efficiently from the rest of the text.

Here, you identify the layout of the document to infer semantic structure. Meaning, purpose, is easier to identify in this document than in the Miss Bates speech, by using rhythm. I’m willing to bet dollars-to-doughnuts that the model could identify this fellow’s rhetoric very quickly if it’s allowed to look for stylistic queues defined by the document.

SomebodySysop · May 28, 2024, 3:38pm

Yes, Step 02 in my process flow.

Thanks. Didn’t realize that!

I am testing with GPT-4o, Gemini Pro 1.5 and Claude Opus. GPT and Claude return the most accurate json while Gemini can handle the larger documents with more inaccuracies.

SomebodySysop · May 28, 2024, 3:47pm

This is the source document: A Court Trial - 64-0412 - Sermon preached by William Branham which is 25,847 Tokens /
105057 Characters

Here is the json hierarchy array:
https://s3.us-west-2.amazonaws.com/docs.scbbs.com/docs/test/node-5979-json.txt

Note that this output is over 7K tokens. I still think GPt-4o is limited to 4K output: What is the token-limit of the new version GPT 4o? - #3 by _j

SomebodySysop · May 28, 2024, 9:08pm

One more thing I noticed: NONE of the models seems to reliably calculate token totals. I guess that’s to be expected since they can’t also can’t count characters, words or lines.

So, I’ll probably need to add a subroutine to the “end line” step 02 to more accurately calculate the token count for each segment.

SomebodySysop · May 29, 2024, 7:41am

Here is modified process flow:

So now:

export the pdf (or whatever) document to txt.
run code to prepend linenoxxxx:
send this numbered file to model along with instructions to create hierarchy json file
process this file with code to add end_line numbers and output that json file.
new: also add token_count to the json file
run code on json output to create the chunks.

I am programmatically adding the token_count for each chunk. * This will make the prompt more efficient, and reduce token output on API call. Also, because it’s being done programmatically, the token count will be accurate!

It is amazing to me that large language models seem to be able to do everything EXCEPT count characters, words, lines or even tokens from the text they “read” and “analyze”. Maybe GPT-5 will overcome these limitations.

Otherwise, Rumors of the Demise of Software Engineers are highly exaggerated. I’m talking to you Nvidia CEO and Devin.

thinktank · May 29, 2024, 3:28pm

I can see why this json is so big! Those title fields are gigantic.

I asked ChatGPT to summarize these title fields into short-tail keyword phrases that maintained meaning:

Summarized Short-Tail Keyword Phrases:

Original: “9 Look at those drunken Roman soldiers on that day, just before the crucifixion took place—sending Him out there and smacking Him on the face, and the cheeks, and things like…”

Short-Tail Keyword Phrase: “Roman Soldiers Crucifixion Scene”

Original: “14 Now this afternoon I thought it would be no more than right, and after His appearing before us and go through the building and discern the thoughts of the heart… Now I’ve only used three or four little scriptures with you, which God knows that all scriptures join together.”

Short-Tail Keyword Phrase: “Afternoon Sermon on Scriptures”

The original titles were 101 tokens, the summarized titles were 15 tokens, using the tokenizer. I think if you could generally reduce those title lengths by 50%—which seems reasonable at a glance—you’d be well within the token limits.

Apparently this has been a major limitation from the very beginning. The best workaround is passing a document to a more traditional word-processing app. The models aren’t great at doing certain things, but they’re good at telling other programs (that are good at those certain things) what to do.

And, as I think about it, that makes sense. Have you seen GPT word hallucinations in images? They look like script you might see in a dream. This makes sense, on a level. The human mind similarly does not comprehend a word as individual characters.

Tereh are fmaous sudties taht sowh if the fsirt and lsat lteetr in a wrod are crorect, and the lnegth of the wrod is auobt crorect, the hmaun mnid can uesnrdtand it. (ChatGPT wrote that! Wow. )

SomebodySysop · May 29, 2024, 7:49pm

Great idea! A good way to reduce token usage. Thanks!

SomebodySysop · May 31, 2024, 1:11am

In my system the token limits are user defined. I actually have embedding configurations which are triggered depending upon the document classifiction.

You are absolutely correct on that one.

Yep. Right again.

What I decided to do was calculate the tokens in my code. Since I have the start and end line numbers for each segment, I simply calculate the number of characters (minus the linenoXXXX) and convert that number to “tokens”. Since I’m not using the model for this, my “tokens” are simply a simple calculation I use across the board: chars / 5.

Using this method, the token counts in my hierarchal outlines are now always correct.

SomebodySysop · May 31, 2024, 1:51am

I looked at this: Semantic Chunking | 🦜️🔗 LangChain

But, I don’t see any advantage over what we are doing here. The advantage to what we are doing is that we have full control, more flexibility, and don’t have to rely on Langchain or python. Well, at least, that’s the advantage for me!

SomebodySysop · May 31, 2024, 3:46am

So, I’ve been thinking about how to share this. I still haven’t finished coding the chunking, but that’s the easiest part of this project.

At first, I thought about sharing the code, but it’s written in PHP and highly integrated into my CMS and will be as cumbersome and unwieldy to implement as a Langchain solution.

Then I thought about @sergeliatko 's idea about sharing the API. That would take a little work on my end, but the result would be an API where you tell it where to grab the pdf and set the chunk token limit and it returns a json array of the hierarchally and semantically chunked content (as we have discussed in this thread) ready to upload to your favorite vector store.

What do you guys think of that? Think it would be useful to anyone other than us? If so, how would you see it working in a way that would make it very simple but highly useful?

Just an idea I’m throwing out.

Topic		Replies	Views
New 4-turbo model has a unique limit? Or is this a bizarre hallucation? API	18	4571	January 26, 2024
Preparing data for embedding API	33	15127	December 16, 2023
The length of the embedding contents API	48	35142	December 13, 2023
It looks like GPT-4-32k is rolling out API gpt-4	202	71896	July 16, 2023
⬛ Splitting / Chunking Large input text for Summarisation (greater than 4096 tokens....) API	24	45933	December 12, 2023

Using gpt-4 API to Semantically Chunk Documents

Summarized Short-Tail Keyword Phrases:

Related topics