Using gpt-4 API to Semantically Chunk Documents

I think the best (and easiest) course would be to eliminate it since the point of the text being crossed out is that it is being replaced. Including it in the embeddings would have the effect of giving the impression the crossed out text is valid.

1 Like

Will let you know if I come up with an idea. It’s a good catch and I realize I might have documents too in the future which have this issue, hence relevant to think about.

1 Like

My approach to solve this type of problems was actually to create separate models for each of the steps within the workflow so that each model knows exactly the task is needed from it and then you just chain the models to fulfill the workflow. Also fine tuning for simpler task as much easier.

1 Like

I solved that by recursive detection of hierarchy from leaves to the trunk, running in parallel on each level. So that first you establish relations of smallest blocks (using the purpose/summary of the block as a single item) which produces a bit blocks. Then you identify purpose/summary of those bigger blocks and detect relation between them. And you keep doing it until you reach the trunk/root.

Alright, I spent a lot of my time this weekend testing semantic search approaches that build on our logic. A few lessons learned from my work so far, acknowledging that still more work is required.

The semantic search was done in the context of my work on comparative analysis between two regulatory documents, which I will just refer to as document 1 and document 2 for the remainder of this text.

I initially created a document outline with our approach for both documents. Through that approach I was able to successfully capture the individual articles/paragraphs with specific regulatory requirements in both documents.

My goal then was to identify for each article/paragraph in document 1, the relevant content from document 2 so I could compare and contrast the regulatory requirements. The tricky part is that document 2, while at the top level addressing the same topic, has a very different logical flow, with relevant content being more spread out and sitting in different places.

I tested variations of the following three approaches for semantic search using cosine similarity as the distance metric to see what yielded the most accurate matches from document 2 for a given document 1 article/paragraph:

  1. Comparing vector embeddings of document 1 and document 2 paragraphs using the actual paragraph text

  2. Comparing vector embeddings of document 1 and document paragraphs using a summary of the paragraph text

  3. Extracting key requirements / topics covered in a paragraph in the form of a simple comma separated list, converting them into a vector embedding and using this is the basis for comparison

  4. Comparing vector embeddings of document 1 paragraphs using the actual paragraph text vis-a-vis vector embeddings of individual semantic units of the paragraphs in document 2

Going into the exercise, I was reasonably confident that (1) and (2) (or a combination of both) would result in solid results. While the results were not poor, analysis however showed that often a relevant paragraph from document 2 was omitted in the identified matches, even for larger number of top returned matches (e.g. 10). I attributed that to the fact that while in principle a paragraph to me is a semantic unit, there are cases where the content covered in paragraphs - even for paragraphs of a similar size - may be more heterogeneous. It’s in those cases, where the results were flawed.

This is what led me to testing the approaches (3) and (4). Option (3) already helped to improve performance and resulted in fewer omissions. Option (4) so far, however, seemed to achieve the most accurate results, with the previously omitted content now being included. Of course, once you start matching at such a granular level, you always risk a bit losing the context. So in my intended approach, I will not only consider the identified semantic unit but also the full paragraph it is a part of. Hence, when later the model needs to perform an analysis on the content, it has all the sufficient context available.

With these preliminary findings in mind, I am now leaning towards a refined approach whereby I will evaluate sections/paragraphs identified through the hierarchy outline process for further breakdown. However, instead of doing this on the basis of the size of the identified section/paragraph, I’ll do it based on content heterogeneity or semantic coherence (for lack of a better term). My initial idea is to create a simple classification approach (using either a fine-tuned model or embeddings-based classification) to evaluate whether a section contains more than one semantic idea. If yes, I will then apply further semantic chunking and use the created semantic chunks as a basis for performing the semantic search.


Why I introduced the “atomic idea” in the first place, with chunking based on the idea change (even within a single paragraph) rather than on anything else.

The issue with losing context because the chunks are short is because most of the solutions do not try to repeat the human approach on context retrieval where the human spots items containing relative content on multiple “details levels” in the same time. By multiple “details levels” I mean the chunks representing the text only in the same time with chunks representing more abstract elements like outlines, summaries etc. Then, once we know where the context is taken from, we focus on the source and extract the surrounding context with details (no matter if we do it mentally or by source text lookup) to get everything we need…

In the LLM context it means the initial search is here simply to spot the areas (sources) that are likely to contain the answer, and the app still needs to go and grab the rest of details it needs to produce the result. The last part is often missing in most of the apps I saw so far.


Amazing, amazing work. Thank you so much for sharing.

Could you please elaborate a bit on how this differs from from this?:


How do you determine the topics? Do you provide a pre-determined list or do you let the model create it? I’ve been rolling around the idea of categorization/classification of individual chunks myself, but still haven’t worked out a good approach.

Good stuff!

Right now, I’m still super-focused on getting the best embeddings I can. However, as I move on to enhancing my “generation”, I’m looking to expand on this methodology: Advanced RAG 01: Small-to-Big Retrieval | by Sophia Yang, Ph.D. | Towards Data Science

Thanks to our hierarchal/semantic chunking approach, I can now always relate any individual chunk to the larger semantic chunk from which it was created. And, using the title path, I can also climb up the hierarchal chunk ladder if necessary. In essence, I can have the model do this:

So much of this just feels so much more clear now. Thank you for your contributions to this discussion.

I wonder how much of that might have been due to structural similarity - in some situations texts that are written similarly will match unreasonably well. Did you observe that?

Oh yes, I remember that “try”… What she is missing is that the big chunk still contains the “noise” information and can easily eat your attention window.

What I’m talking about is an approach where your vector search looks not only in the text itself (doesn’t matter small or big) but also in other information like:

  • what is the structure of the chunk itself (summary of the text + it’s structured outline)
  • what are the main entities in the text (salience ordered list of entities present in the text / high level of abstraction detailed title of the chunk generated by AI)
  • where does this chunk belongs to (the high level title of the parent)

Then my approach produces more chunks than there are in the text as some of the chunks are “leaves” - atomic ideas (in this case they don’t have outlines, only high level title + high level title of the parent section) or “branches” - those represent section structures, in this case the chunks do not have text from the source but the outline instead (high level titles of the children indented for hierarchical relation on 2 levels down).

Independently from all the data above, each chunk has info about whether it is “text” or “container” type and the exact path from the root.

Then the whole thing is converted to text representation (excluding the path), ordered as title, body, entities, parent title for text elements, and title, outline, entities, parent title for containers.

So the resulting vector is naturally weighed toward the center of topics with a little surge toward the textual content of the element. Also the presence of entities and outlines diminishes the “noise” in the final vector used by RAG.

When running a vector search over this chunks (my vector search uses not only the question but also a slight push toward the center of several synthetic samples of text (word combinations that are likely to be around/in the desired answer), I get results where the top of them contains a mixture of the text type chunks (the right ones usually, but sometimes “aliens” who got there for no comprehensible reason) and the containers containing the answer. The containers are almost 100% sorted from the one I need to find, followed by the ones who got additional info related to the query, then less related info (just in case you need it) and maybe one or two “aliens”.

Since I’m playing with this, I’ve basically never seen the wrong container in the top results. Also, almost all the time the containers go in order from smaller to the root, where the smaller is the one that contains the answer. If the answer is spread across the “branches”, then I have the similar patterns for all related branches (by branch I mean the path from the root to the target paragraph/idea including siblings).

But as my app requires the precision close to absolute (legal docs analysis), I added the “security” layers before I use any of the chunks: models that confirm the usefulness of the found item to produce the answer or add info to improve the answer quality. The goal is to eliminate the noise, save resources on detailed context retrieval, and avoid adding unrelated info into the prompt.

Then I check if the filtered results contain enough of the info I need to answer the query (here the outlines of grand parent containers help a lot as the filter “sees” the detailed summary of what is around the found items to decide if more context is needed). I use the path of the element to get the grand parent path (remove 2 last items in the path of the element if I don’t have already the needed container in the results).

And only if the filter decides it needs more content, I go get it by pulling the text-type children of the container I need (simple graphql). But from what I see it’s rare when I need more context, because originally I pull more results on the first query and most of the noise leaves its place to valuable items.

Hope that helps

1 Like


Comparing vector embeddings of document 1 and document 2 paragraphs using the actual paragraph text

Under this approach I embed the full text of a paragraph (whereby a paragraph be of a size from anywhere between 30 - 200 words) that is identified through the hierarchical outline process.

Comparing vector embeddings of document 1 paragraphs using the actual paragraph text vis-a-vis vector embeddings of individual semantic units of the paragraphs in document 2

The way I initially tested this involved embedding the full text of a paragraph from document 1 (as per above) while embedding smaller semantic units of a given paragraph in document 2. So I basically broke down the paragraphs into further semantic units, recognizing that some paragraphs, despite not having more than 100-200 words, address multiple different requirements. When just embedding the full text, semantic search often failed to identify these paragraphs as matches, despite including relevant content. When I opted for the embedding and search based on the smaller semantic units, they were selected as expected.

I am currently in the process of creating a basic fine-tuned model for the creation of the semantic units as I am not yet consistently happy with how the model semantically chunks a paragraph. Will update this post once I have done some basic testing later today.

Currently just using a standardized prompt and have the model to the classification. Again, acknowledging that a paragraph may contain multiple different requirements, I ask the model to return a comma-separated list of the nature of requirements / topics covered in the paragraph.

1 Like

Interesting point. I have not yet specifically analyzed for that and on the surface I don’t see the pattern just yet. That said, I still have on my plate to take a deeper look at the high ranking matches that were not relevant and identify the common root causes that have led to the match in the hopes I can identify strategies for removing them. So far my focus was on ensuring the right chunks were present at all. Will let you know if I find anything interesting.

I will note that an obvious source of noise were things like definitions at the beginning of a document and I already moved these from the pool of embeddings used for the search.

1 Like

There might be a solution to the strikethrough/strikeout text issue.

I have searched high and low and can’t find any PDF to text extractor which can even identify if the PDF has strikethrough/strikeout text, let alone remove it. However, when I tried using a model to do it, I at least was able to make some progress: Discord

Unable to share the ChatGPT GPT-4o chat, but this was it’s output (after correcting for not excluding the same strikeout dates that Gemini repeatedly missed):

This is the source PDF:

The problem here, of course, is that I would now have to upload the PDF to the model for it to simply identify if there are strikethrough characters.

The alternative is to use the model to do the text extraction in the first place. Something I never considered before, but I’m starting to warm up to the idea.

But my worry there is – how reliable will these models be at extracting the exact text, and not “hallucinating”?

The most frustrating part of this is the fact that, with all the hoopla and hype about these models being “multi-modal”, they still struggle with the simple task of finding and eliminating strikethrough text from a standard PDF. And we still have to worry about them not inserting their own text in extractions.

I mean, seriously, what is the business use of having a model identify a handwritten picture of a duck?

If these things were reliable text extractors, THAT’s a real business use. Today.

I did briefly try with one Python’s PDF libraries (I can’t remember top of mind which one it was and seem to already have deleted again the script) and in principle it was possible to identify and remove strikethrough text. However, when tested with your document the results were rather discouraging. I since did not have time to look into alternatives.

I personally think that besides the cost component, it is likely not the most efficient approach. It is likely that the model will return the document content verbatim only up to a certain limit without any issues (well below the token limit) - so you would have to keep the input text limited. I tried something along those lines at some point last year and often observed the model slowing down or even failing at the request after a certain volume of text was reached.

All this to say, I still have on my list to look at this again in more detail. Will let you know if I end up finding some alternative options.

1 Like

Very intresting topic. I have the same problem with open source Chunking solution. So i have to build one of my own.

For this you may use a self rag that will help to reduce the hallucinating and also the chatgpt model has a hallucinating ranging from 2% to 3.5%. Combining it to a self rag i manage to correct some hallucinating.
Maybe you can try it in you Chunk solution.

By the way your discord link do not work.

I managed to get started by properly identifying the task needed for it: separate text in blocks on a subject or idea change so that the resulting blocks contain only one idea at a time.

Then built an app that would take a lot of input text big chunks (because of the token limit) and return the same text separated by a special string ( <-----> in my case), saving it in a separate text file.

Then used code comparing tool to spot the differences between 2 files and adjusted the splits manually ( took about a week of full time editing to get the first excellent dataset of about 150 pages of raw text converted in jsonl file).

Then fine-tuned on this file and started using that model as splitter in my app to get about another 500 pages. Once fine-tuned on the totality of the data, got a nice tool that does that. Read my previous posts, I already explained my approach on continuous fine-tuning.

Was a monstrous job, but the result is worth it.

1 Like

Use ready to use APIs for that

So far, I’ve not found aPDF to text extraction API (I use Textract a lot) or library which can identify and exclude strikethrough text. Well, that’s not true. Marker GitHub - VikParuchuri/marker: Convert PDF to markdown quickly with high accuracy will do it, but with the test document I gave it, it didn’t do it well.

So, my goal with using a model is simply to handle this particular use case – extracting text from PDF while excluding strikeout.

If you know of another way to do it, please let me know! As I said, I use Textract a lot and was on a call with AWS support for an hour just today – they can’t do it. They recommend using Claude Sonnet.

Have you checked the apis which allowed to convert PDF with line through text in some editable format which support line through markdown like word and then read the file directly from that? From what I see on the subject might work out.

Because of the myriad ways in which an input specification in PDF files can give rise to strikethroughs, it might be a very difficult programming exercise to detect all strikethroughs. That said there might be only a few subsets of methods through which strikethroughs are implemented in common use cases.

However logically it seems to me that it might be better for a function to “see” the strikethroughs and “omit” them from the resulting pdf; therfore arguing for a vision model based approach to remove strikethrough. For a large pdf, there’s nothing to prevent reading one page at a time to remove strikethroughs.

1 Like