Should we remove special characters from string for embedding

After extracting text from files, the text often includes newline characters (such as \n ), backslashes (\ ), and other special characters. When using an embedding API, are these special characters automatically excluded or included in the processing? If they are included, is it recommended to input a ‘cleaned’ version of the string without these special characters for optimal results?

You’ll need to clean the data yourself. The extra tokens will sway it’s vector - maybe enough to not match the root words? I’d embed with clean data, so it has a better chance to match with clean data you later feed it.

3 Likes