Preprocessing for embeddings

yemane · July 11, 2023, 3:31pm

I’m currently trying to do some topic modeling on articles. I have many (40+) possible categories. I’m currently doing something similar to Recommendation_using_embeddings.ipynb.

The tutorial in the link doesn’t include any pre-processing of the text sent to the text-embedding-ada-002 model.

I am wondering if pre-processing the text of the article makes sense. Like removing stop words and punctuation, making words lowercase, etc…
Since I’ve found an article with someone using another embeddings model and the person did do pre-processing multi-class-text-classification-with-doc2vec-logistic-regression.

If you agree pre-processing makes sense. Is there a best practice for what kind of pre-processing to do?

AidanM · July 11, 2023, 3:34pm

I’ve never thought about pre-processing, so it might be a good idea. For my own personal purposes, without the pre-processing, I have found lots of success with embeddings, so I think you are fine. I use embeddings for text that gets lifted off of construction documents, so there is little consistency with punctuation and capitalization.

wfhbrian · July 11, 2023, 3:39pm

Pre-processing the text before using it with the text-embedding-ada-002 model can indeed be beneficial, but isn’t necessary.

A much less sophisticated model is used in the “multi-class-text-classification-with-doc2vec-logistic-regression” example. Less sophisticated models benefit more from the reduction-type preprocessing strategies you listed.

For text-embedding-ada-002, the type of pre-processing I would consider would be more likely to be additive, as they would add content to the embedding text input, rather than remove it. This would increase the overall relevant context provided to the model. Anything that might be necessary for your specific solution.

The specific pre-processing steps you choose will depend on your specific use case, the characteristics of your text data, and the model you choose for embedding. It can be helpful to experiment with different pre-processing techniques and evaluate their impact on the quality of the embeddings and the performance for your desired outcome.

I hope this helps! Let me know if you have any further questions.

yemane · July 26, 2023, 7:04pm

The use case is trying to use embeddings to find the most similar topic to an article. A kind of topic modeling/classification.

Would it make sense to make everything lowercase? And remove non-alpha-numeric characters in this case? Thinking these steps would at least remove some unnecessary noise

Topic		Replies	Views
Text Pre-processing for text-embedding-ada-002 Community embeddings	2	5080	December 17, 2023
Preprocessing Guidelines for Embedding API	1	1657	December 17, 2023
Regarding Text Preprocessing for Fine Tuning Prompting	6	3695	February 24, 2023
Embeddings Text Prep API	1	628	December 17, 2023
Help with fine-tuning for text categorization API	4	1309	December 16, 2023

Preprocessing for embeddings

Related topics