RAG is not really a solution

SomebodySysop · April 9, 2024, 12:20am

Yes, please. Always interested in increasing semantic similarity efficiencies.

anythxwdo · April 9, 2024, 12:36am

Sent via pm. It’s terse, let me know if you need more details.

SomebodySysop · April 9, 2024, 3:01am

Got it! Also sent a response. Thanks!

joyasree78 · April 9, 2024, 4:20am

Yes please, I will also be interested. it will be great if you can please pm me

anythxwdo · April 9, 2024, 1:10pm

Sent, it was terse and let me know if you need more details.

anythxwdo · April 9, 2024, 1:10pm

Sure thing, you’re very welcome.

joyasree78 · April 9, 2024, 1:28pm

Thanks a lot for your message. I got the message and responded to the message also.

anythxwdo · April 9, 2024, 1:35pm

Good to know we all experiment with more efficient techniques.

jr.2509 · April 9, 2024, 1:54pm

I am currently experimenting with using GPT models to identify relevant semantic chunks or “logical units” as I call it. As you indicate, not the fastest nor exactly the cheapest way to do it. Initial results are promising although I probably would consider moving to a fine-tuned model for even more nuanced identification of relevant semantic chunks, which would then further allow to minimize further the need for a manual review and adjustment of the chunks.

My ultimate goal is to automate the process as much as possible including the generation of relevant metadata. Still a long way to go to achieve that.

qrdl · April 9, 2024, 1:57pm

Yeah, back propagating through chained heterogenous models is the holy grail.

Probably just need a larger NN differentiable architecture? Not sure.

anon22939549 · April 9, 2024, 4:43pm

In general, absent a very specific reason not to, you should share things publicly so the entire community may benefit from it and be part of the conversation around it.

Moving conversations to side-channels is against the spirit of the community forum.

SomebodySysop · April 9, 2024, 6:53pm

Please keep us posted on your progress. I’ve been thinking it through, and so far the hardest part seems to be coming up with a prompt to explain it. The models aren’t good at counting (I’ve found out the hard way), so trying to get it to understand it needs to limit to a certain number of characters is one thing. Then getting it to return the original texts (and not summaries or re-wordings) is another. And then coming up with a format that your code can read and use to actually create the chunks…

So, yes, inquiring minds would certainly like to know how this works out.

anythxwdo · April 9, 2024, 7:01pm

Your point well taken. I’ll share more tonight.

sergeliatko · April 9, 2024, 7:35pm

Great approach. Personally, I think RAG is not a reply to everything, especially if the stucture/storage/retrieval were not thought through enough.

I ended up using something like:

Get raw text formatted by semantic chunks
Extract main idea/ purpose from each chunk
Build hierarchical tree from the chunks
Extract storable data from the tree
Import tree, data and meta info into vector DB
Pre -process query to identify solution algorithm and isolate subtasks
analyse subtasks to identify what type of data is needed as context for the task
Find data from DB and perform each task separately
Combine all results (often with filtered context) to an algorithm prompt
Run the algorithm prompt with all the context
(Often run filters on the final result)

sergeliatko · April 9, 2024, 7:41pm

See example of how machine builds outline from sematic chunks extracted from a raw text (OCR) of a associates agreement. These are just titles given by the machine to chunks and organized parent -child from a raw text (what I get after this the steps is a nested tree of JSON objects with data I need):

PACTE D’ASSOCIÉS
   Préambule et objet du pacte d'associés
     Identification des parties contractantes :
       - Identification de la société « HOLDING DMS » avec détails de représentation ...
       - Identification et représentation de la société SAS DEVELOPPEMENT ...
       - Identification de la société SAEM LES SAISIES VILLAGES TOURISME ...
       - Identification et représentation de la société SPL DOMAINES SKIABLES DES SAISIES ...
       - Identification et représentation de la société « Family Office » ...
     Définition des parties contractantes en tant qu'« Associés »,
     Identification et représentation de la société SH LES SAISIES lors de l'accord
     Identification de la société SH LES SAISIES et de son représentant ...,
     Préambule sur la constitution de la société SH LES SAISIES et ses objectifs :
       Exposé préalable sur la constitution de la société SH LES SAISIES et la structure de propriété de la filiale « LES CHALLIERS ».
     Estimation du coût total de construction de l'Ensemble touristique :
     Structure du financement de l'opération de construction :
     Intérêt de MGM EXPLOITATION pour la gestion du complexe touristique sous condition d'un contrat de management.
     Estimation du budget d'investissement mobilier pour l'exploitation du complexe touristique.
     Formalisation du partenariat pour la création et l'exploitation de l'Ensemble touristique « Les Challiers ».
     Préambule et objet du Pacte d'Associés pour la création de l'Ensemble touristique « Les Challiers » :
   ARTICLE 1 – STRUCTURATION DU PARTENARIAT
     1.1 – Constitution d'une société par actions simplifiée et répartition du capital social :
     1.2 - Constitution et répartition du capital de la société civile immobilière « LES CHALLIERS » :
     1.3 Modalités de financement de la construction de l'Ensemble touristique.
     1.4 - Conditions de bail post-construction entre LES CHALLIERS et SH LES SAISIES avec référence à l'annexe.
     1.5 - Financement initial pour aménagements et lancement d'exploitation de SH LES SAISIES.
   ARTICLE 2 – BESOINS FINANCIERS
     Engagement des Associés pour le financement initial de SH LES SAISIES et sa filiale LES CHALLIERS par apports en compte courant :
       - Engagement d'apport en compte courant par SPL DOMAINES SKIABLES DES SAISIES
       - Participation financière de HOLDING DMS
       - Participation financière de SAS DEVELOPPEMENT
       - Apport en compte courant par Family Office au capital de la société
     Apport en compte courant spécifique de la SPL DOMAINES SKIABLES DES SAISIES pour la Société SH LES SAISIES ;
     Modalités d'adressage des appels de fonds par le Président aux Associés.
     Modalités de remboursement des avances en compte courant d'associés et conditions de non-sollicitation avant le 5ème anniversaire de l'ouverture au public.
     Modalités de rémunération des avances en compte courant selon le taux d'intérêt fiscalement déductible.
     Obligation de cession des participations pour non-respect des appels de fonds.
   ARTICLE 3 – OPERATIONS SUR TITRES
     Interdiction de sûretés réelles sur les Titres de la Société pendant la durée du pacte.
     Définition du terme « Titres » dans le cadre du Pacte
       (i) Définition et exemples de titres et valeurs mobilières émis par la Société,
       (ii) Droits d'attribution ou de souscription de valeurs mobilières ou titres similaires
       (iii) Inclusion de toutes valeurs mobilières émises par la Société dans les droits des associés.
     Clause d'agrément pour la cession de titres aux tiers nécessitant une décision collective extraordinaire des associés.
     Définition exhaustive du terme « Cession » dans le Pacte
     Définition du terme « Tiers » excluant les signataires et entités contrôlées au sens de l'article L 233-3 du code de commerce.
     Droit préférentiel de souscription des Associés en cas d'augmentation de capital.
     Interdiction d'engagement de non-concurrence lors de la Cession de Titres.
     3.1 - Inaliénabilité des titres
       Clause d'inaliénabilité des titres pour

BTW, forgot to note that the models used to build the tree from raw text were trained as “general legal doc parsers” on 60 contracts only and have never seen this type of contacts.

The class properties, their names (as the result the order of properties in the class) and module configuration also pay VERY important role on the embedding and retrieval in weaviate… But I do confirm: weaviate is the best so far.

curt.kennedy · April 9, 2024, 8:59pm

There is an algorithm I came up with and mentioned in a post here a while back, can’t find it ATM, but the idea is you start with a semantic or keyword centroided chunk, and then expand and offset one sentence/paragraph, etc. at a time, and embed this search expansion to match the incoming query.

You then declare the matching chunk to be the one with the highest semantic similarity to the input.

So you are searching dynamically, to within your offset and radius parameters, and finding the best chunk on the fly.

This takes time, so you could do this after each query to optimize your chunking for future results, or if you have the time, you can have the answer wait until the optimal chunk is found.

This does require the notion of a dynamically indexed, continuous, floating chunk of data. So you need to index where you are in the corpus, and your starting and ending indices amongst the larger corpus, even though you may start with non-optimized crude chunks at the start of this process.

You could even use multiple models for this as well, to average through any embedding model specific biases.

anythxwdo · April 9, 2024, 10:04pm

@anon22939549 and all,
So, here goes.

First, let’s start with the external data source. If each chunk / small text file size ranging from 1k to 3k something in one directory, and say, for such a small chunk / file, it may have 400 words, turning each into a token would be 400 tokens. Then, from token to vector another process.

What if you use a reliable library that would automatically summarize each chunk / small text file into a two-to-three sentence summary, it would have only 50 to 60 words per summary. Now, you use some library like BERT to tokenize each summary and vectorize them as well. During the process, create a summary_text_to_its_corresponding_vector mapping or index. Also, I should mention, earlier you would also need a mapping for your tiny summary_file to its actual file/small file.
– in sum, the summarization step would increase efficiency.

Now, when a user enters a query, you tokenize it and vectorize it and then use cosine similarity or cosine distance to seek the nearest vector (of the tiny summaries vector database or storage). And use the earlier mapping or index to retrieve the content of the actual chunk/small file.

Now, you’ve completed the “Retrieval” process.

Notes:

Ensure each of the small chucks / small text files is a single semantic unit. Otherwise, if one chuck contains two or three different themes or topics, it would likely compromise retrieval quality.
Verify the quality of the summary extraction is pretty good.

A side note:
It seems some developers are already using this technique. I’m not surprised. Hope it helps.

anythxwdo · April 9, 2024, 10:41pm

Interesting. Hope some researcher with resources can turn it into an algorithm. fyi, I’m a consultant.

jr.2509 · April 10, 2024, 5:30am

Yes, a few hurdles to cross here.

The return of original texts I’ve managed successfully so far.

Some of the difficult parts are that due to the limits in output tokens, you still have to pre-chunk your documents in order to be processed by the model. This requires a bit of a separate strategy.

The other point is to properly manage the process for automatically having the model identify the document architecture and hierarchy and then to preserve the mapping between the automatically identified semantic chunks and the document structure so that this information can be incorporated during the vector embedding process.

Anyway - I guess it would not be fun if it was all too easy!

Hopefully a few weeks down the road I have a few more updates to share on this.

joyasree78 · April 10, 2024, 6:07am

For the layout parsing(manage the process for automatically having the model identify the document architecture and hierarchy), I convert the PDF into markdown using Azure Doc AI and then do a markdown splitting on the headers and section. Adobe Extract API is another solution. Are there any other good tools for layout parsing. With unstructured, I could not make it work

Topic		Replies	Views
Using gpt-4 API to Semantically Chunk Documents API embeddings	176	16485	August 27, 2024
Document Sections: Better rendering of chunks for long documents Prompting vector-db , semantic-search	65	29441	September 27, 2024
Processing Large Documents - 128K limit API gpt-4	41	5615	November 8, 2024
Prompting with the chat/completions API against a large transcript file API	5	3372	October 4, 2023
The length of the embedding contents API	48	31562	December 13, 2023

RAG is not really a solution

Related topics