Does convert all text to lower case before embbeding it will make embedding similarity search more accurate?
There are differences in the tokenisation of upper and lower case letters, so you cannot rule out that it will make some difference, usually words in upper and lower case are VERY close together when vectorised so they have a very similar “score” to one another, but they are not identical.
If I were building a system that relied on repeatability and I had to use an AI, I would use a “lower()” function to lower case everything. But it “should” not be required in typical use.
1 Like