Should we remove special characters from string for embedding

jimkeecn · January 9, 2024, 4:26am

After extracting text from files, the text often includes newline characters (such as \n ), backslashes (\ ), and other special characters. When using an embedding API, are these special characters automatically excluded or included in the processing? If they are included, is it recommended to input a ‘cleaned’ version of the string without these special characters for optimal results?

PaulBellow · January 9, 2024, 4:28am

You’ll need to clean the data yourself. The extra tokens will sway it’s vector - maybe enough to not match the root words? I’d embed with clean data, so it has a better chance to match with clean data you later feed it.

Topic		Replies	Views
Text Pre-processing for text-embedding-ada-002 Community embeddings	2	5050	December 17, 2023
Preprocessing for embeddings API	4	5260	December 17, 2023
Is it good practice to send html tags with context API chatgpt	1	764	January 30, 2024
How to prepare the content of HTML page for embeddings calculation API api	2	770	June 6, 2024
Typos in Embeddings- to Fix or Not? API	3	889	April 24, 2023

Should we remove special characters from string for embedding

Related topics