Text Pre-processing for text-embedding-ada-002

Hello Community

What techniques did you use for pre-processing, what works the best in terms of contextual understanding.
Right now i’m:

-lowering all the text
-removing punctuations
-removing numbers
-removing multiple whitespace/removing newlines
-removing special characters

From my experience removing stop words do not favor the results.
lemmatizing words especially.

now this could probably vary from use-case.
My specific use case is to score documents, but some have little distinction so it needs to be able to maximize contextual meaning.

I have some thoughts.

  1. Do as little pre-processing as possible.
  2. Removing punctuation can drastically change the meaning of text and the point of embeddings is to enable semantic search.
  3. Numbers are important. For semantic search the values of the numbers aren’t very important, but their presence is. I wouldn’t remove them. You might consider replacing them with some type of placeholder value, say every significant digit is replaced with a 5 or something. But I think that’s more effort than it’s worth and I’m not convinced it is worth anything.

If you do any pre-processing at all, I would focus on ensuring everything has correct spelling, grammar, and punctuation.

If verbatim text in the embeddings isn’t critically important to you, something else you might consider doing is to augment your embeddings with a bunch of synthetic data. Basically, take whatever you’re embedding and have gpt-3.5-turbo or gpt-4 rewrite it a bunch of times with different goals. More verbose, more concise, executive summary, etc, then embed those as well. Keep in your database the original text, rewritten text, and the embedding. Then, when doing a search of the vector DB, you will have essentially increased the footprint of the thing you’re embedding making it easier to surface.

Also, if you’re working a lot with embeddings, I recommend using HyDE.