Did you try the semantic chunking from langchain
I have not tried it yet, but the approach looks similar to yours.
Did you try the semantic chunking from langchain
I have not tried it yet, but the approach looks similar to yours.
Yeah - these tools are on option for sure but ultimately they are quite costly and just create undue vendor dependence, so my goal is to develop my own custom solution for the scope of documents I am dealing with.
In my thinking, I’ve already done this “pre-chunk” using Semantic Chunking: https://youtu.be/w_veb816Asg?si=yr4TLKFi_sGex4Pm
Now, I want to semantically chunk these semantically chunked pre-chunks (I hope that sentence makes sense!) into semantically complete pieces of text that are no larger than x tokens.
At this point, we are talking about sections or subsections or chapters or sub-chapters .
In the case of The Bible, for example, we would be talking about breaking down the chapters by verse.
In the case of The Talmud, we are talking about breaking down the tractates by “dafs”.
In a legal contract or municipal code, we would be talking about breaking down sections or sub-sections semantically.
My thinking is that we could use the model to do this – we just need the right prompt which tells it what to do.
Yes, keep us updated. I’m sure you’ll come up with that prompt!
Yes, it all makes sense I have a generic prompt that is generally working, i.e. it returns logical pieces of texts. That said, it needs to be customized further for different use cases. As you rightly point out, the appropriate approach depends on the type of document.
I’ve got that from linguistics, the key is to thoroughly following the human’s text comprehension workflow:
Glance - identify blocks - scan blocks - read blocks - identify subject - cut blocks on subject change - re-read blocks - identify parents/children - trace semantic connections between blocks - establish causality
Currently I have 7 trained models for most of the tasks (some are redundant for machine vs human) to get that workflow. Using asynchronous requests to API does speed up the process. So a full parsing of a legal document of about 15 pages takes 2 minutes and costs about 1.5 USD. As a result you get full hierarchical tree with nested chunks each being 1-3 sentences (humans lose idea focus on general at 3-4 sentences). Personally, consider it “almost perfect” chunks.
Thanks for sharing. You’ve definitely got an interesting and sound approach there and there’s some good learnings and insights from that.
A big part of the challenge is that I am dealing with a highly diverse body of often very long documents with heterogeneous document architectures etc. The documents share commonalities but prior to applying the chunking mechanism I need to have a very sound approach for identifying and analyzing their architecture and then make a determination on how to best chunk them. In certain cases the architecture is defined through a table of contents etc., which facilitates this process. In others it requires a bottom-up approach (which I think might be similar to what you are doing as part of the semantic analysis).
I also think that besides the subject and hierarchical analysis, it will be useful to classify informational chunks by their functional role and purpose in the document.
So as part of what I am working towards I think I will need to define a universal taxonomy that distinguishes between different document/information types and then breaks down the common elements of the documents. This taxonomy would then serve as a common baseline for the architectural analysis.
As said, lots of details to think through. But hopefully once I get it right for one subset of documents, it will be easier to replicate it across other ones.
What I stated is the algorithm.
It’s your basic matched filter.
Matched filtering is used in many places, this would be the matched filter approach to semantic chunking.
I see. What I didn’t express well enough was that the approach I described above is just a mere chunks preparation to be imported in the vector DB. We are not even close to analysis. This is just the initial layer in the “comprehension” module. The goal here is to get semantic chunks with isolated ideas, so that their vectors to not get “diluted” by unrelated information and produce precise results (closer vector matches) in the “data mining” module which goes after the “comprehension”.
Once you have your chunks of “source” data, you may start extracting “knowledge” out of them by asking precise questions and filling your knowledge DB with answers based on the data “hidden” in the source. Here, sky is the limit and your approach depends on your application design and goals.
Classification, if necessary, comes after this step.
The real analysis and conclusions are the steps way later.
Classic case of miscommunication
What I meant was not the extraction of insights but rather the document structure analysis.
Let’s take a regulation as one example. The document in most cases has multiple different components. This may include a section at the beginning with definitions of key terms, followed then by the detailed articles with the detailed regulatory requirements etc. Performing an initial analysis of the document’s components would help me in two ways:
(1) It helps to adjust the chunking approach. For example for definitions I would strictly treat every term as a separate chunk while for regulatory requirements I would apply a more nuanced semantic chunking approach.
(2) It serves as valuable meta data for embedding and subsequent retrieval where it could be useful for improved filtering.
I can see that one could also perform such an analysis after the chunking but in my case I see a benefit in completing it prior to the actual chunking.
PACTE D’ASSOCIÉS
Préambule et objet du pacte d'associés
Identification des parties contractantes :
- Identification de la société « HOLDING DMS » avec détails de représentation ...
- Identification et représentation de la société SAS DEVELOPPEMENT ...
- Identification de la société SAEM LES SAISIES VILLAGES TOURISME ...
- Identification et représentation de la société SPL DOMAINES SKIABLES DES SAISIES ...
- Identification et représentation de la société « Family Office » ...
Définition des parties contractantes en tant qu'« Associés »,
Identification et représentation de la société SH LES SAISIES lors de l'accord
Identification de la société SH LES SAISIES et de son représentant ...,
Préambule sur la constitution de la société SH LES SAISIES et ses objectifs :
Exposé préalable sur la constitution de la société SH LES SAISIES et la structure de propriété de la filiale « LES CHALLIERS ».
Estimation du coût total de construction de l'Ensemble touristique :
Structure du financement de l'opération de construction :
Intérêt de MGM EXPLOITATION pour la gestion du complexe touristique sous condition d'un contrat de management.
Estimation du budget d'investissement mobilier pour l'exploitation du complexe touristique.
Formalisation du partenariat pour la création et l'exploitation de l'Ensemble touristique « Les Challiers ».
Préambule et objet du Pacte d'Associés pour la création de l'Ensemble touristique « Les Challiers » :
ARTICLE 1 – STRUCTURATION DU PARTENARIAT
1.1 – Constitution d'une société par actions simplifiée et répartition du capital social :
1.2 - Constitution et répartition du capital de la société civile immobilière « LES CHALLIERS » :
1.3 Modalités de financement de la construction de l'Ensemble touristique.
1.4 - Conditions de bail post-construction entre LES CHALLIERS et SH LES SAISIES avec référence à l'annexe.
1.5 - Financement initial pour aménagements et lancement d'exploitation de SH LES SAISIES.
ARTICLE 2 – BESOINS FINANCIERS
Engagement des Associés pour le financement initial de SH LES SAISIES et sa filiale LES CHALLIERS par apports en compte courant :
- Engagement d'apport en compte courant par SPL DOMAINES SKIABLES DES SAISIES
- Participation financière de HOLDING DMS
- Participation financière de SAS DEVELOPPEMENT
- Apport en compte courant par Family Office au capital de la société
Apport en compte courant spécifique de la SPL DOMAINES SKIABLES DES SAISIES pour la Société SH LES SAISIES ;
Modalités d'adressage des appels de fonds par le Président aux Associés.
Modalités de remboursement des avances en compte courant d'associés et conditions de non-sollicitation avant le 5ème anniversaire de l'ouverture au public.
Modalités de rémunération des avances en compte courant selon le taux d'intérêt fiscalement déductible.
Obligation de cession des participations pour non-respect des appels de fonds.
ARTICLE 3 – OPERATIONS SUR TITRES
Interdiction de sûretés réelles sur les Titres de la Société pendant la durée du pacte.
Définition du terme « Titres » dans le cadre du Pacte
(i) Définition et exemples de titres et valeurs mobilières émis par la Société,
(ii) Droits d'attribution ou de souscription de valeurs mobilières ou titres similaires
(iii) Inclusion de toutes valeurs mobilières émises par la Société dans les droits des associés.
Clause d'agrément pour la cession de titres aux tiers nécessitant une décision collective extraordinaire des associés.
Définition exhaustive du terme « Cession » dans le Pacte
Définition du terme « Tiers » excluant les signataires et entités contrôlées au sens de l'article L 233-3 du code de commerce.
Droit préférentiel de souscription des Associés en cas d'augmentation de capital.
Interdiction d'engagement de non-concurrence lors de la Cession de Titres.
3.1 - Inaliénabilité des titres
Clause d'inaliénabilité des titres pour
Here is an example of how the chunks approach works in real life, just the first layer, the one I’m talking about. Sorry it’s in french (use Gpt4 or deepl to translate it if needed).
What you see is the “tree” of a resulting document (JSON structure) built by a simple function that walks through the tree, grabs the chunk title (leaving the content) and prepends tabs based on chunk depth level in the document (not full tree given here).
Is that what you need as your first item? (Here each term definition is a separate chunk).
The meta data can be obtained by queries run against the vectorized chunks.
Yup, this would be part of what I need. I will likely add another more abstract layer so as to enable improved comparability across different documents.
In any case, I will need to work through a proof of concept for what I have in mind as the application. Hopefully, by the end of the month I can report back with some more conclusions on where I’ve landed.
Also, thanks again for sharing abstracts of your work. It’s always interesting and helpful to understand how others are approaching these problems - and sparks new ideas. One of the key reasons that makes this Forum such a great place!
@jr.2509 , @sergeliatko , @joyasree78
This concept of using the model itself to semantically sub-chunk (if I can use that expression) hierarchically chunked content is fascinating and I’d like to explore it even more. We all seem to be on the same page as to what we wish to accomplish, but have very different ideas on how to get there – which is a good thing.
But, I think we may be hijacking the original intent of this thread. Should I create a new topic to continue this particular discussion? If so, how should I title it?
I am all for it:) one recommendation on the title is “context distillation for llm”
Now, that is impressive. This is similarly what we achieve, but so far it’s a manual sludge: 2023 Theatrical and Television Memorandum of Agreement | labor.booksaAI.org - a booksAI.org project
In this particular case, most (if any) of these do not need to be sub-chunked. But, if they did, that’s where I would want to use the model to create the chunks.
Wait, we both have backgrounds in French and Linquistics? C’est incroyable!!
I’ll talk to my associate to see if LAWXER would make the “comprehension” module available as a general public API. What we currently have is internal async API that takes URLs of several files (pretty much anything as we have conversions and OCR) and posts to a callback URL the structured JSON of the document folder (several documents under the same root)
Originally I’m from Belarus, used to be a tech interpreter of English Russian and German. Then moved to USA in 2000 but didn’t like it to stay, so moved to France in 2021, worked in hospitality (chef barman), then switched to development back in 2011… Now owner/CTO at both TechSpokes Inc and LAWXER SAS. One is software for vacation rentals the other is legal document analysis AI.
Which ones you got as well?
Nothing prevents you to merge leaves with parent branch using pretty much any programming language to be a bit less precise.
Yep, that’s a pretty good idea. No particular preferences for the title but I guess that’s because it’s already getting pretty late where I am
So, I’m trying to understand. Right now, we bookmark a pdf document, split by bookmark which creates several document “chunks”, then run those chunks through an embedding process which will further sub-chunk those documents which exceed x tokens.
Let’s take that same pdf – how would it work with your API to achieve the same results (i.e., a list of the hierarchal structure of the document)?
Thanks. Appreciated. I probably still want to experiment first a little bit of my own. It’s part of what makes the journey so fun.
Let’s better take a bunch of photos of a contract and all it’s annexes taken with a smartphone.
You upload them to your server,
Then you send us a request with urls ordered by page number, order id (so that you know what we send you) and a callback URL parameters you need the structured JSON posted to when ready.
We reply http 202 (accepted) or error if something goes wrong with your request.
We convert fotos to raw text (flat ugly string of characters)
Then it goes to our pipeline that handles the whole process
When the JSON structure is ready, we send it to the URL you asked together with order id.
If anything goes wrong we post the error message with order id to the same url.
The JSON will be element tree with it’s nested children (multilevel depending on the document) where each leaf is a chunk with it’s content, title, name (purpose) and path from the root. If multiple docs were sent in the same order, they will be first level children of the root.
Then you do whatever you need with this.
The outline I shared is built with a simple array walk recursive run on the root element to grab the element name and prepend tabs to visually show the sublevels.
Takes about 2 minutes for 15 pages text.