RAG is not really a solution

,

Yes, please. Always interested in increasing semantic similarity efficiencies.

Sent via pm. It’s terse, let me know if you need more details.

1 Like

Got it! Also sent a response. Thanks!

Yes please, I will also be interested. it will be great if you can please pm me

Sent, it was terse and let me know if you need more details.

1 Like

Sure thing, you’re very welcome.

Thanks a lot for your message. I got the message and responded to the message also.

Good to know we all experiment with more efficient techniques.

I am currently experimenting with using GPT models to identify relevant semantic chunks or “logical units” as I call it. As you indicate, not the fastest nor exactly the cheapest way to do it. Initial results are promising although I probably would consider moving to a fine-tuned model for even more nuanced identification of relevant semantic chunks, which would then further allow to minimize further the need for a manual review and adjustment of the chunks.

My ultimate goal is to automate the process as much as possible including the generation of relevant metadata. Still a long way to go to achieve that.

2 Likes

Yeah, back propagating through chained heterogenous models is the holy grail.

Probably just need a larger NN differentiable architecture? Not sure.

1 Like

In general, absent a very specific reason not to, you should share things publicly so the entire community may benefit from it and be part of the conversation around it.

Moving conversations to side-channels is against the spirit of the community forum.

5 Likes

Please keep us posted on your progress. I’ve been thinking it through, and so far the hardest part seems to be coming up with a prompt to explain it. The models aren’t good at counting (I’ve found out the hard way), so trying to get it to understand it needs to limit to a certain number of characters is one thing. Then getting it to return the original texts (and not summaries or re-wordings) is another. And then coming up with a format that your code can read and use to actually create the chunks…

So, yes, inquiring minds would certainly like to know how this works out.

1 Like

Your point well taken. I’ll share more tonight.

3 Likes

Great approach. Personally, I think RAG is not a reply to everything, especially if the stucture/storage/retrieval were not thought through enough.

I ended up using something like:

  1. Get raw text formatted by semantic chunks
  2. Extract main idea/ purpose from each chunk
  3. Build hierarchical tree from the chunks
  4. Extract storable data from the tree
  5. Import tree, data and meta info into vector DB
  6. Pre -process query to identify solution algorithm and isolate subtasks
  7. analyse subtasks to identify what type of data is needed as context for the task
  8. Find data from DB and perform each task separately
  9. Combine all results (often with filtered context) to an algorithm prompt
  10. Run the algorithm prompt with all the context
  11. (Often run filters on the final result)
1 Like

See example of how machine builds outline from sematic chunks extracted from a raw text (OCR) of a associates agreement. These are just titles given by the machine to chunks and organized parent -child from a raw text (what I get after this the steps is a nested tree of JSON objects with data I need):

PACTE D’ASSOCIÉS
   Préambule et objet du pacte d'associés
     Identification des parties contractantes :
       - Identification de la société « HOLDING DMS » avec détails de représentation ...
       - Identification et représentation de la société SAS DEVELOPPEMENT ...
       - Identification de la société SAEM LES SAISIES VILLAGES TOURISME ...
       - Identification et représentation de la société SPL DOMAINES SKIABLES DES SAISIES ...
       - Identification et représentation de la société « Family Office » ...
     Définition des parties contractantes en tant qu'« Associés »,
     Identification et représentation de la société SH LES SAISIES lors de l'accord
     Identification de la société SH LES SAISIES et de son représentant ...,
     Préambule sur la constitution de la société SH LES SAISIES et ses objectifs :
       Exposé préalable sur la constitution de la société SH LES SAISIES et la structure de propriété de la filiale « LES CHALLIERS ».
     Estimation du coût total de construction de l'Ensemble touristique :
     Structure du financement de l'opération de construction :
     Intérêt de MGM EXPLOITATION pour la gestion du complexe touristique sous condition d'un contrat de management.
     Estimation du budget d'investissement mobilier pour l'exploitation du complexe touristique.
     Formalisation du partenariat pour la création et l'exploitation de l'Ensemble touristique « Les Challiers ».
     Préambule et objet du Pacte d'Associés pour la création de l'Ensemble touristique « Les Challiers » :
   ARTICLE 1 – STRUCTURATION DU PARTENARIAT
     1.1 – Constitution d'une société par actions simplifiée et répartition du capital social :
     1.2 - Constitution et répartition du capital de la société civile immobilière « LES CHALLIERS » :
     1.3 Modalités de financement de la construction de l'Ensemble touristique.
     1.4 - Conditions de bail post-construction entre LES CHALLIERS et SH LES SAISIES avec référence à l'annexe.
     1.5 - Financement initial pour aménagements et lancement d'exploitation de SH LES SAISIES.
   ARTICLE 2 – BESOINS FINANCIERS
     Engagement des Associés pour le financement initial de SH LES SAISIES et sa filiale LES CHALLIERS par apports en compte courant :
       - Engagement d'apport en compte courant par SPL DOMAINES SKIABLES DES SAISIES
       - Participation financière de HOLDING DMS
       - Participation financière de SAS DEVELOPPEMENT
       - Apport en compte courant par Family Office au capital de la société
     Apport en compte courant spécifique de la SPL DOMAINES SKIABLES DES SAISIES pour la Société SH LES SAISIES ;
     Modalités d'adressage des appels de fonds par le Président aux Associés.
     Modalités de remboursement des avances en compte courant d'associés et conditions de non-sollicitation avant le 5ème anniversaire de l'ouverture au public.
     Modalités de rémunération des avances en compte courant selon le taux d'intérêt fiscalement déductible.
     Obligation de cession des participations pour non-respect des appels de fonds.
   ARTICLE 3 – OPERATIONS SUR TITRES
     Interdiction de sûretés réelles sur les Titres de la Société pendant la durée du pacte.
     Définition du terme « Titres » dans le cadre du Pacte
       (i) Définition et exemples de titres et valeurs mobilières émis par la Société,
       (ii) Droits d'attribution ou de souscription de valeurs mobilières ou titres similaires
       (iii) Inclusion de toutes valeurs mobilières émises par la Société dans les droits des associés.
     Clause d'agrément pour la cession de titres aux tiers nécessitant une décision collective extraordinaire des associés.
     Définition exhaustive du terme « Cession » dans le Pacte
     Définition du terme « Tiers » excluant les signataires et entités contrôlées au sens de l'article L 233-3 du code de commerce.
     Droit préférentiel de souscription des Associés en cas d'augmentation de capital.
     Interdiction d'engagement de non-concurrence lors de la Cession de Titres.
     3.1 - Inaliénabilité des titres
       Clause d'inaliénabilité des titres pour

BTW, forgot to note that the models used to build the tree from raw text were trained as “general legal doc parsers” on 60 contracts only and have never seen this type of contacts.

The class properties, their names (as the result the order of properties in the class) and module configuration also pay VERY important role on the embedding and retrieval in weaviate… But I do confirm: weaviate is the best so far.

2 Likes

There is an algorithm I came up with and mentioned in a post here a while back, can’t find it ATM, but the idea is you start with a semantic or keyword centroided chunk, and then expand and offset one sentence/paragraph, etc. at a time, and embed this search expansion to match the incoming query.

You then declare the matching chunk to be the one with the highest semantic similarity to the input.

So you are searching dynamically, to within your offset and radius parameters, and finding the best chunk on the fly.

This takes time, so you could do this after each query to optimize your chunking for future results, or if you have the time, you can have the answer wait until the optimal chunk is found.

This does require the notion of a dynamically indexed, continuous, floating chunk of data. So you need to index where you are in the corpus, and your starting and ending indices amongst the larger corpus, even though you may start with non-optimized crude chunks at the start of this process.

You could even use multiple models for this as well, to average through any embedding model specific biases.

2 Likes

@anon22939549 and all,
So, here goes.

First, let’s start with the external data source. If each chunk / small text file size ranging from 1k to 3k something in one directory, and say, for such a small chunk / file, it may have 400 words, turning each into a token would be 400 tokens. Then, from token to vector another process.

What if you use a reliable library that would automatically summarize each chunk / small text file into a two-to-three sentence summary, it would have only 50 to 60 words per summary. Now, you use some library like BERT to tokenize each summary and vectorize them as well. During the process, create a summary_text_to_its_corresponding_vector mapping or index. Also, I should mention, earlier you would also need a mapping for your tiny summary_file to its actual file/small file.
– in sum, the summarization step would increase efficiency.

Now, when a user enters a query, you tokenize it and vectorize it and then use cosine similarity or cosine distance to seek the nearest vector (of the tiny summaries vector database or storage). And use the earlier mapping or index to retrieve the content of the actual chunk/small file.

Now, you’ve completed the “Retrieval” process.

Notes:

  1. Ensure each of the small chucks / small text files is a single semantic unit. Otherwise, if one chuck contains two or three different themes or topics, it would likely compromise retrieval quality.
  2. Verify the quality of the summary extraction is pretty good.

A side note:
It seems some developers are already using this technique. I’m not surprised. Hope it helps.

1 Like

Interesting. Hope some researcher with resources can turn it into an algorithm. fyi, I’m a consultant.

Yes, a few hurdles to cross here.

The return of original texts I’ve managed successfully so far.

Some of the difficult parts are that due to the limits in output tokens, you still have to pre-chunk your documents in order to be processed by the model. This requires a bit of a separate strategy.

The other point is to properly manage the process for automatically having the model identify the document architecture and hierarchy and then to preserve the mapping between the automatically identified semantic chunks and the document structure so that this information can be incorporated during the vector embedding process.

Anyway - I guess it would not be fun if it was all too easy!

Hopefully a few weeks down the road I have a few more updates to share on this.

1 Like

For the layout parsing(manage the process for automatically having the model identify the document architecture and hierarchy), I convert the PDF into markdown using Azure Doc AI and then do a markdown splitting on the headers and section. Adobe Extract API is another solution. Are there any other good tools for layout parsing. With unstructured, I could not make it work

1 Like