Embedding and searching from similar embeddings

I have been reading through the forum on embedding, saving and retrieving vectors and then using those retrieved embeddings and their context to answer queries.

I have been trying to build a simple web app for our internal users who work with legal documents (property searches so a lots similar documents for different properties). Following is my understanding of how I could implement a small proof of concept kind of web portal would be to.

  1. Extract and collect data from those legal documents. I have a JSON file created from those legal documents so my data is in well defined sections already.
  2. Get those sections for each report converted into embeddings and save those in a vector database (thinking about going with. Popular option of pinecone).
  3. Convert user query into embedding, find the most relevant match(s), get their string representation and use those plain text bits as context for my query and use completion to return an accurate and nice looking answer to my user.

I am slightly confused about a couple of bits and just wanted suggestions to validate my understanding and correct it if I need to.

  1. When I save embeddings for each chunk of each report, I save the text representation of that vector also in pinecone or do I use another storage solution like RDMS to save the vector and string relationship. Is vector storage also meant for saving plain text?
  2. When my user asks any question, I know which property(the report) they are currently asking about but to get the correct embedding for that property, do I make address/reference number of that property part of each embedding or save that as a separate field/column in pinecone index? For example, a user may query ā€œlist all planning chargesā€, do I add the property address with each query ā€œlist all planning charges for house number 1, street 1ā€¦ā€¦ā€ or is that gonna get me a lot of false matches as in every single embedding for each report, my address will match so my query vector finding call may even ignore the planning charges bit and instead return me ā€œfinancial charges for house number 1, street 1ā€¦.ā€
  3. Is there a way I can filter my results from vector storage based on property address if I save that as plain text on index for each embedding?
  4. Some times, the large reports have a lot of information for each section that means I canā€™t send all of the relevant embeddings to Open AI in one prompt due to char limit. Is the recommended approach sending one call per embedding content as context and then combing result of each call or should I merge embeddings to make fewer calls?

Apologies for the long post but couldnā€™t stop adding all this detail. :joy:

Process:

  1. Extract the text data from your legal documents

  2. Choose an embedding model it was something like this: ada-text-embeddin-002 or prev versions.

  • You need to determine how much text from the documents is converted into embeddings (summary, paragraphs, whole document?); think about how to find the most related document rather than the generated answers. You may need to classify them before turning it into vectors. Yes defining property will help, for instance if you use weaviate.
  1. Save the document embeddings in a database like Pinecone, Weaviate, ā€¦
  • Weaviate has simple and fuzzy search this mean that you can filter base on raw text, to filter nested objects appropriately, you need to specify the filter at the level of the nested array, rather than using a reference from the parent object.
  1. When a user query comes in, generate its embedding and find the closest match(es) in your database

  2. Retrieve the text for those matches to use as context in your response

  3. Filter any necessary legal terms

  4. Evaluate and tune your system using train/test splits, use the models like gpt-4 for make evaluation.

  • You need to choose right embedding model based on vector size, window size, iterations, and tune these for the best performance, at the end split the data into train/test sets to properly evaluate the performance of the system.
1 Like

Hi,
Iā€™m doing something similar, how did you go with the comparison of the query embedding and the embedding database?

Sorry, I havenā€™t tried it yet. Hopefully will try over the weekend and see what sort of results do I get.

i am doing something very similar by embedding legal documents. i think most of your problems will be solved by looking at the format in which pinecone wants you to submit the data. itā€™s in json.

  1. you can store the chunk as metadata in pinecone.
  2. you can also store other data as metadata. there is an option in the pinecone query function to enable metadata in the search. i suspect this will do the trick.
  3. also done by using pineconeā€™s metadata.
  4. i did not quite understand the problem here. you might want to try summarizing using gpt3.5. this obviously compromises the integrity of the data, which can be critical in a legal setting. personally, i chunk the data into semantically significant pieces (e. g. ā€˜articlesā€™, for the constitution), which seems to work well enough.
1 Like

Hi.

Thanks for the tips. I have implemented a basic workflow which works (I generate the embeddings for a report, upload that on cloud and then load those in memory to find the nearest match for each question so no vector database involved atm) but my biggest issue is the accuracy of the nearest matched embeddings for each question as a lot of embeddings are quite similar to each other (working with UK property documents so there are about 7,8 different sections related to the listed buildings itself).

Most questions and their answers are accurate if they arenā€™t related to similar embeddings but for those problems one, my cosine similarity function usually picks the similar but incorrect embedding. I canā€™t think of a better way to generate those embeddings but I guess I will always have a problem with these similar sections and questions.

1 Like

Hey did you make any progress on this?

What if you save the JSON related categorisation data as META data instead of vectorising it, and then use GPT 4 to understand the request, and formulate a query that initially filters out the results based on the META data rather than the vectorised data.

Then, re-run the query on the now-filtered vectorised data, so that you can be sure the output is accurate?