Text Pre-processing for text-embedding-ada-002

anon22939549 · August 4, 2023, 8:43pm

I have some thoughts.

Do as little pre-processing as possible.
Removing punctuation can drastically change the meaning of text and the point of embeddings is to enable semantic search.
Numbers are important. For semantic search the values of the numbers aren’t very important, but their presence is. I wouldn’t remove them. You might consider replacing them with some type of placeholder value, say every significant digit is replaced with a 5 or something. But I think that’s more effort than it’s worth and I’m not convinced it is worth anything.

If you do any pre-processing at all, I would focus on ensuring everything has correct spelling, grammar, and punctuation.

If verbatim text in the embeddings isn’t critically important to you, something else you might consider doing is to augment your embeddings with a bunch of synthetic data. Basically, take whatever you’re embedding and have gpt-3.5-turbo or gpt-4 rewrite it a bunch of times with different goals. More verbose, more concise, executive summary, etc, then embed those as well. Keep in your database the original text, rewritten text, and the embedding. Then, when doing a search of the vector DB, you will have essentially increased the footprint of the thing you’re embedding making it easier to surface.

Also, if you’re working a lot with embeddings, I recommend using HyDE.

Topic		Replies	Views
Preprocessing for embeddings API	4	4504	December 17, 2023
Preprocessing Guidelines for Embedding API	1	1572	December 17, 2023
Preprocessing Techniques for Generating Embedding Vectors from Legal Texts with text-embedding-3-large API embeddings	4	735	June 3, 2024
Regarding Text Preprocessing for Fine Tuning Prompting	6	3513	February 24, 2023
Embedding very sensitive to punctuation API embeddings , ada , ada002	4	1180	December 8, 2023

Text Pre-processing for text-embedding-ada-002

Related topics