Preprocessing for embeddings

I’m currently trying to do some topic modeling on articles. I have many (40+) possible categories. I’m currently doing something similar to Recommendation_using_embeddings.ipynb.

The tutorial in the link doesn’t include any pre-processing of the text sent to the text-embedding-ada-002 model.

I am wondering if pre-processing the text of the article makes sense. Like removing stop words and punctuation, making words lowercase, etc…
Since I’ve found an article with someone using another embeddings model and the person did do pre-processing multi-class-text-classification-with-doc2vec-logistic-regression.

If you agree pre-processing makes sense. Is there a best practice for what kind of pre-processing to do?

1 Like

I’ve never thought about pre-processing, so it might be a good idea. For my own personal purposes, without the pre-processing, I have found lots of success with embeddings, so I think you are fine. I use embeddings for text that gets lifted off of construction documents, so there is little consistency with punctuation and capitalization.

Pre-processing the text before using it with the text-embedding-ada-002 model can indeed be beneficial, but isn’t necessary.

A much less sophisticated model is used in the “multi-class-text-classification-with-doc2vec-logistic-regression” example. Less sophisticated models benefit more from the reduction-type preprocessing strategies you listed.

For text-embedding-ada-002, the type of pre-processing I would consider would be more likely to be additive, as they would add content to the embedding text input, rather than remove it. This would increase the overall relevant context provided to the model. Anything that might be necessary for your specific solution.

The specific pre-processing steps you choose will depend on your specific use case, the characteristics of your text data, and the model you choose for embedding. It can be helpful to experiment with different pre-processing techniques and evaluate their impact on the quality of the embeddings and the performance for your desired outcome.

I hope this helps! Let me know if you have any further questions.

2 Likes

The use case is trying to use embeddings to find the most similar topic to an article. A kind of topic modeling/classification.

Would it make sense to make everything lowercase? And remove non-alpha-numeric characters in this case? Thinking these steps would at least remove some unnecessary noise