Preprocessing Techniques for Generating Embedding Vectors from Legal Texts with text-embedding-3-large

I’m currently working on a project related to court decisions, and I’m planning to use text-embedding-3-large model for embedding. These decisions often contain complex texts, and obtaining accurate results is crucial.

While most of my data consists of regular texts, there are also texts containing experimental processes and results (numbers). Is there any guidance or suggestions on preprocessing the text chunks to be used when generating embedding vectors?

I’m particularly interested in the following preprocessing techniques:

  1. Should I remove text chunks containing numerical values before feeding them into the embedding model?
  2. Should I delete line breaks indicating different paragraphs, or should I keep them intact?
  3. Does converting all characters to lowercase improve the quality of embedding vectors?

If you have any insights or resources on how these preprocessing techniques might affect the quality of text-embedding-3-large embeddings, sharing this information would greatly benefit my project and others working in similar domains. Given the importance of the potential outcomes, I kindly request your recommendations.

Note: Database has 3.2 million decisions and i have to convert vector all.
Current database is MongoDB but, i’m converting to vector and saving to Milvus database.

1 Like

I vectorized approximately 20,000 Court of Cassation decisions related to assault (dim: 3072). However, when I vectorize questions that could be directed to the search engine using the same model and compare them with these 20,000 Court of Cassation decisions, I get moderately consistent results. Sometimes, even though I directly write and translate the words mentioned in the decisions, it still does not provide consistent results. If you have any suggestions for a solution, I would appreciate it if you could point out whether it is related to Milvus or the methods I am using.

Okay. You need more, and I mean WAY MORE than just a run-of-the-mill vector embed for a task like that. You need absolute accuracy. And with that size?

Eesh

Ur looking into a task way more advanced.
Think… uhm
Chunk embed dspy AND pyG using umap for graph optimization. You can’t afford beta-sheets completely outside the Llm manifold’s “reach”. And legal has plenty outliers.
Then there’s the whole way back.

Didn’t get it?
That’s what I mean

Legally, there are no restrictions, at least not for the institution I work for. Additionally, personal data in the documents is sanitized. I don’t intend to delve into too much detail and set up an artificial neural network system running on my own server. Instead, I’m trying to leverage the API capabilities I have using ChatGPT APIs and server resources. My server is not overly comprehensive; it’s just an ordinary, dedicated server.

Ur talking about RAG, aren’t you?
I am too.
Just saying you would like to have a high accuracy retrieving results.