To avoid duplicates, I keep track of what I have embedded previously with the HASH of the previous embedding text, and use this as an index into my embedding database.
For example, “mouse” has a Sha3-256 hash of
import hashlib
X = "mouse"
H = hashlib.sha3_256(X.encode())
print(f"HASH: {H.hexdigest()}")
# HASH: 6ca66ca0a1713279fbdc1d0f5568d033d3593004251c3f564186a4e4e284cdeb
Then whenever I embed anything else, I compute the hash, and see if it is in my embedding database, if it is, then I don’t have to embed it, I just pull the previous embedding vector. You won’t have duplicates if you do this!
Note that GPT is case sensitive, so “Mouse” is different than “mouse”, and luckily this results in a separate hash too:
import hashlib
X = "Mouse"
H = hashlib.sha3_256(X.encode())
print(f"HASH: {H.hexdigest()}")
# HASH: 4c2e2fe9ae1d56701bea18593b67dc59d862106f959a132c640352780b5d0339
You can go with lower-case hashes too, but realize GPT “sees” that “Mouse” is different than “mouse”.
Note: Sha3-256 is probably overkill, but that’s what I use these days.
Oh, and to be clear, this is only on the database/lookup side. In my case, for search, I scan the database to create an in-memory data structure that is a python dict, keys are the hash and the numpy version of the embedding vector. Then this is saved as a pickle to S3, and loaded in-memory when I am ready to search. So you will periodically update this file as your embedded data (knowledge) changes over time.