I have some thoughts.
- Do as little pre-processing as possible.
- Removing punctuation can drastically change the meaning of text and the point of embeddings is to enable semantic search.
- Numbers are important. For semantic search the values of the numbers aren’t very important, but their presence is. I wouldn’t remove them. You might consider replacing them with some type of placeholder value, say every significant digit is replaced with a
5
or something. But I think that’s more effort than it’s worth and I’m not convinced it is worth anything.
If you do any pre-processing at all, I would focus on ensuring everything has correct spelling, grammar, and punctuation.
If verbatim text in the embeddings isn’t critically important to you, something else you might consider doing is to augment your embeddings with a bunch of synthetic data. Basically, take whatever you’re embedding and have gpt-3.5-turbo
or gpt-4
rewrite it a bunch of times with different goals. More verbose, more concise, executive summary, etc, then embed those as well. Keep in your database the original text, rewritten text, and the embedding. Then, when doing a search of the vector DB, you will have essentially increased the footprint of the thing you’re embedding making it easier to surface.
Also, if you’re working a lot with embeddings, I recommend using HyDE.