Is there available a comprehensive list of text preparations that affect the embeddings? I know about line feeds because you mention it in the documentation. But I’ve run into other things that seem to make enough of a difference not to ignore. (For example, the cosine similarity between “I have a dream” and “I have a dream.” on davinci is 0.9780036.) Also, it would be good to know whether stop words should be removed or included. Basically, what considerations are there for maximizing quality?
Related topics
| Topic | Replies | Views | Activity | |
|---|---|---|---|---|
| Text Pre-processing for text-embedding-ada-002 | 2 | 5497 | December 17, 2023 | |
| Preprocessing Guidelines for Embedding | 1 | 1725 | December 17, 2023 | |
| Preprocessing for embeddings | 4 | 5826 | December 17, 2023 | |
| What is the basis for embeddings calculation? | 6 | 290 | June 10, 2024 | |
| Do embedding models treat line breaks, list/bullet formatting etc as semantically meaningful? | 2 | 49 | January 19, 2026 |