Is there available a comprehensive list of text preparations that affect the embeddings? I know about line feeds because you mention it in the documentation. But I’ve run into other things that seem to make enough of a difference not to ignore. (For example, the cosine similarity between “I have a dream” and “I have a dream.” on davinci is 0.9780036.) Also, it would be good to know whether stop words should be removed or included. Basically, what considerations are there for maximizing quality?
Related topics
Topic | Replies | Views | Activity | |
---|---|---|---|---|
Text Pre-processing for text-embedding-ada-002 | 2 | 4565 | December 17, 2023 | |
Preprocessing Guidelines for Embedding | 1 | 1572 | December 17, 2023 | |
Preprocessing for embeddings | 4 | 4504 | December 17, 2023 | |
Preprocessing Techniques for Generating Embedding Vectors from Legal Texts with text-embedding-3-large | 4 | 736 | June 3, 2024 | |
How to prepare the content of HTML page for embeddings calculation | 2 | 474 | June 6, 2024 |