Preprocessing Techniques for Generating Embedding Vectors from Legal Texts with text-embedding-3-large

HmKa · June 3, 2024, 7:59pm

I’m currently working on a project related to court decisions, and I’m planning to use text-embedding-3-large model for embedding. These decisions often contain complex texts, and obtaining accurate results is crucial.

While most of my data consists of regular texts, there are also texts containing experimental processes and results (numbers). Is there any guidance or suggestions on preprocessing the text chunks to be used when generating embedding vectors?

I’m particularly interested in the following preprocessing techniques:

Should I remove text chunks containing numerical values before feeding them into the embedding model?
Should I delete line breaks indicating different paragraphs, or should I keep them intact?
Does converting all characters to lowercase improve the quality of embedding vectors?

If you have any insights or resources on how these preprocessing techniques might affect the quality of text-embedding-3-large embeddings, sharing this information would greatly benefit my project and others working in similar domains. Given the importance of the potential outcomes, I kindly request your recommendations.

Note: Database has 3.2 million decisions and i have to convert vector all.
Current database is MongoDB but, i’m converting to vector and saving to Milvus database.

HmKa · June 3, 2024, 8:06pm

I vectorized approximately 20,000 Court of Cassation decisions related to assault (dim: 3072). However, when I vectorize questions that could be directed to the search engine using the same model and compare them with these 20,000 Court of Cassation decisions, I get moderately consistent results. Sometimes, even though I directly write and translate the words mentioned in the decisions, it still does not provide consistent results. If you have any suggestions for a solution, I would appreciate it if you could point out whether it is related to Milvus or the methods I am using.

fernandohenriquesp · June 3, 2024, 10:43pm

Okay. You need more, and I mean WAY MORE than just a run-of-the-mill vector embed for a task like that. You need absolute accuracy. And with that size?

Eesh

Ur looking into a task way more advanced.
Think… uhm
Chunk embed dspy AND pyG using umap for graph optimization. You can’t afford beta-sheets completely outside the Llm manifold’s “reach”. And legal has plenty outliers.
Then there’s the whole way back.

Didn’t get it?
That’s what I mean

HmKa · June 3, 2024, 11:01pm

Legally, there are no restrictions, at least not for the institution I work for. Additionally, personal data in the documents is sanitized. I don’t intend to delve into too much detail and set up an artificial neural network system running on my own server. Instead, I’m trying to leverage the API capabilities I have using ChatGPT APIs and server resources. My server is not overly comprehensive; it’s just an ordinary, dedicated server.

fernandohenriquesp · June 3, 2024, 11:14pm

Ur talking about RAG, aren’t you?
I am too.
Just saying you would like to have a high accuracy retrieving results.

Topic		Replies	Views
Concrete use-case for a fine-tuned model for lawyers API	7	1659	November 26, 2023
Embedding and searching from similar embeddings API	6	6753	October 27, 2023
Create chatbot to assist lawyers based on uploading the law file (Saudi law) Community embeddings , chatgpt , langchain , openai-documentation , vector-store	4	227	May 14, 2025
How to Optimize Text Chunking for Improved Embedding Vectorization? API vector-db , semantic-search	6	11095	December 15, 2023
Searching Using Vectors Derived from Long Text Segments in an Embedding Model API embeddings , api	4	2473	December 15, 2023

Preprocessing Techniques for Generating Embedding Vectors from Legal Texts with text-embedding-3-large

Related topics