Is there available a comprehensive list of text preparations that affect the embeddings? I know about line feeds because you mention it in the documentation. But I’ve run into other things that seem to make enough of a difference not to ignore. (For example, the cosine similarity between “I have a dream” and “I have a dream.” on davinci is 0.9780036.) Also, it would be good to know whether stop words should be removed or included. Basically, what considerations are there for maximizing quality?
Related topics
Topic | Replies | Views | Activity | |
---|---|---|---|---|
Text Pre-processing for text-embedding-ada-002 | 2 | 5084 | December 17, 2023 | |
Preprocessing Guidelines for Embedding | 1 | 1657 | December 17, 2023 | |
Preprocessing for embeddings | 4 | 5310 | December 17, 2023 | |
What is the basis for embeddings calculation? | 6 | 241 | June 10, 2024 | |
Using embeddings for semantic search on transcripts | 1 | 970 | March 20, 2023 |