I have several extracted pieces of legal documents which are essentially the items I am looking to extract in unseen documents. There are 3 labels (multilabel), so each item has ‘text: extracted text, label: A’.
I would like to use this information in order to form a better query. I have seen approaches of forming embeddings in order to answer questions on the data but my approach should rather take in the gathered information to be more capable of finding the items in the unseen text.
Is the fine tuning approach a way to go here? Is there also a way to incorporate the text embeddings of my gathered data and incorporate it into my query, like “Find me the closest similar text compared to embeddings”
Based on how you have described your problem, it might not even need GPT tbh. The same way you are creating the embeddings of the legal documents, you can create the embeddings for the unseen text as well and then compare the two using cosine similarity. This will return to you the the similarity scores between the two pieces of embed text, with the higher score indicating more similarity.
I totally support @udm17 in this case, embeddings would be the way to go if you want a good solution! But this might be a bit more complex.
One simple way you also could acheive that would be to provide some context in the URL via the systems message. So describe in there initially how the categorization should be done and then you could continue with your existing solution with the API.
However if you are able to, follow what @udm17 said as this would be the best approach and also better leverage of tokens used
Thank you for your replies. I did an embedding of the dataset (previously extracted items and annotated) and the corpus (unseen document). After doing so I went through all entries of the dataset (query) and semantically searched for the 5 closest hits in the unseen document.
I think this is similar to what you guys suggest? I still have the feeling that this approach is rather “raw” and some improvements could tweak it into a better direction. I tried gathering the suggested nearest results of the unseen document but did not find a good metric to postprocess the results.
Also this approach goes through each embedding of the dataset to find 5 (or) x similar sentences in the unseen document. I would rather incorporate all embeddings of the dataset (per label) and create a universally more applicable, universal query.
Good summary of your goal. Before I say anything more, can you tell me - are you a domain expert who wants to be the developer of this solution? Or, are you a developer? Just trying to suss out your incentives for this solution.
Okay - thanks. This helps me shape my comments a little better. @udm17 is spot on for your approach.
The only additional comment I have is to consider the data model for embracing the sources of the corpus texts. This is sometimes a knarly part of AI solution transformations. Embeddings are very useful for pointing to the spot where the gold lay. By themselves, knowing similarities is only half the challenge; you still need the related content either by value or by reference.
When using Pinecone (for example), I shape the data model to include meta-fields that contain summaries of the answers, if not the complete answers themselves. Meta-fields are not limited to text; they can be values and links. I’ve found that a vector store with lots of meta-data can lessen the impact of additional token use at query time because it affords the ability to vector into a collection of candidates that rank high and then with filtering you can perform last-mile refinements. Pinecone even has integrated filtering in its API which lets you gather the similarities and filter in one step.
This is confusing to me - can you give a complete example? I was expecting there to be label A, B, and C.
Thank you all for your feedback. The multilabel might be a misunderstanding, I just have the case that sometimes a text piece can be A and B (which I am currently embedding as label C).
Pinecone sounds very interesting, do you have resources or the meta fields? I am not 100 % sure how to use this for my case. Is Pinecone also able to reduce all exemplary embeddings into one higher abstraction of the pattern? I was looking into LangChain to maybe map reduce my embeddings. Anyone with experience on that?
That’s not its purpose. Rather, it is there only to easily locate similarities. The LLM is used to create vectors; Pinecone stores those vectors (and optionally, metadata); the LLM is then used to determine vectors for new queries which are then matched up with your “trained” vector database. Highest scoring results are then presented to the user, or the metadata of high-scoring items is optionally used in a prompt to craft a natural language result for the user. This is where LangChain might be used to create useful conversational outcomes.
Super interesting. I read myself into pinecone and I think I now understand your point. Using pinecone index I add in all my extracted data after extracting the embedding. Now I have a pinecone index which I can run similarity search on. From my perspective I have a high amount of sentences in the unseen documents. I would compare each sentence with the most similar sentences in the pinecone index.
Do you think that is a valid approach? I would have to add a certain similarity threshold in order to say if the compared sentence from the unseen document is either of the three labels (or nothing at all).
Thinking about it. Do you think it makes more sense to load the unseen document into a pinecone index, then going over my “dataset” and getting the most similar extractions out of the pinecone index?
Good question. It might be useful to increase the training corpus. But this is generally not the intent. Typically, we use a known and well-vetted corpus to provide the vectors that matter most for the solution. Unseen sentences and queries are intended to be the measure of relevance to the known corpus, not the corpus itself. It’s possible there are some solutions that would warrant this.
Again thank you so much for your feedback. That is also an interesting point. So you would say it is not the goal to extend the “training” corpus (now the indexed vectors in pinecone)?
Also I currently have pretty high similarities even with sentences which are not contextually present in my vector space (~0.80). Do you have any experience in thresholding or other inhibiting measurement techniques for this case?
Better stated - we always want to improve our vector databases, but I tend to follow a rigid workflow for doing this. I wrote about the general process here.
Airtable my company, we gather analytics about questions put forth to the embedding system and how well the results perform. When we see a list of hits that have a relatively low similarity threshold, it’s an indication that the corpus may be lacking in some way. My approach makes this simple to identify, and my app makes it even easier to embellish the corpus.
I don’t, but I will say that any specific high-similarity match is not always a good (or complete) answer. Sometimes it requires mashing up the top three hits that are sent to GPT with the intent of summarizing the results from a range of relevant texts in the corpus.
I see. Thank you. I think I will step back a bit and try iteratively to move from one vector per label to several very good representations of the labels I am looking for, rather than having all ~280 free texts embedded into pinecone. Still I am always looking for different perspectives. Also the thought of having a unified embedding representing all free text embeddings of one label really won’t let me go.