Vector Duplication In VDBs

Hey all.

I’m creating a memory module that uses vector databases (VDB) like pinecone. I have the module made, but I have a unit test that I run quite often and it keeps adding the vectors to the pinecone index every single time. My assumption was that it would not add embeddings that were the exact same or there would be a method to check for it first. I couldn’t find any documentation or discussion on how to deal with this. After thinking about it, I realized that there are probably 2 reasons why you can’t just check for an existing key.

  • First, there is maybe a bit of noise or non-deterministic embedding generation going on, so in that sense it is not like hashes that you can rely on.
  • Secondly, I’m not sure of the math involved with embedding creation and its possible that a high collision rate might not allow for a direct check of the embedding.

I have a temporary solution. I’ll basically cleanse the formatting and white space from the text/documents and create a hash from that as the namespace. Pinecone has no limit on namespaces, so that seems like the best course.

My question is, however, is there a way to deal with this in pinecone? I’m making a framework and I’ll be adding chromaDB and Weviate interfaces, so maybe I’ll run across some solution in those eventually, but I have yet to see anything for Pinecone.

Additionally, if anyone can correct my thinking about this problem with additional context, it would be greatly appreciated. Thanks.

2 Likes

Did you try just querying before adding the new entry? You should be able to give it a top_k of 1 and if it comes back with a score of 1.0 it already exists…

1 Like

Thanks @stevenic. No I have not, I’m using langchain and the way they have set up their functions semi obfuscated that possibility from me as I’m not familiar with the base pinecone capabilities. Looks like langchain supports that ability in a round about way. I’ll implement it, but I have a feeling I’ll reimplement it myself in some manner more directly with the sole purpose of a precommit check. Just bootstrapping with libraries and modules right now.

Thanks for the heads up though, appreciate it.

@stevenic

Seems that, in pinecone (cosine similarity) via langchain at least, you don’t get a flat “1.0” back. Here is a list of scores and the related content (first 36 chars) in the DB. I’ll just use >= 0.99 for right now and investigate alternative similarity options.

This is not in the database. 0.721304357
The Project Gutenberg eBook of The 0.999768734
such a one be dismissed! 16. Whil 1.00746632
spears and shields, protective mantl 1.00490797
an army. This causes restlessness in 1.00168884
Measurement; Calculation to Estimati 1.00581348
were like unto rolling logs or stone 1.00166762
19. Knowing the place and the time o 1.00571954
advantage, the leaders of all your t 1.0039736
an army that is returning home. 3 1.00100482
in salt-marshes. 9. In dry, level 1.00055492
the officers are angry, it means tha 1.00754535
wait for him to come up. 11. If t 1.00474679
halfway towards victory. 30. Henc 1.004462
their uttermost strength. 24. Sol 1.00315511
44. When you penetrate deeply into a 1.00529444
XII. THE ATTACK BY FIRE 1. Sun 1.00327432
height of inhumanity. 3. One who 1.00327837
be renamed. Creating the works fr 1.00279188
works in compliance with the terms o 1.00338161
version posted on the official Proje 1.00542641
fees. YOU AGREE THAT YOU HAVE NO REM 1.00573575
501(c)(3) educational corporation or 1.00188982

These embedding engines do not produce the exact same vector each time, there is variation, or noise, on each call, so 1.0 is a tall task.

Having said this, what is deterministic is your hash, namely strip all leading and trailing white space, then hash it. Check the existence of this hash, not the vector, before proceeding,

2 Likes

Interesting, I would have expected the same exact string to generate the same exact vector. Good to know that’s not always the case.

2 Likes

I have worked so long with floating point numbers to know that the further out decimal places are, you get essentially random numbers, and it is often network related. So to be safe, use hashes, or multiply up modestly (and round) and use integer representations instead to get true lock.

1 Like

Yeah I figured there would be precision errors if I went that route and accounted for it. But I’m getting 1000s place difference. So I suspect it is the built in similarity noise you were talking about. I’ll use a hybrid of threshold checking @stevenic was talking about and the hashing method.

Thanks for the help.

At any rate, if you have 95% or greater similarity then the case is probably covered.

1 Like

Yeah I’m not sure what the scores are that are being returned. The database itself on Pinecone is set to cosine similarity which should be -1 to 1, but the results are coming out above 1, so I don’t know what it is that they are returning. I’ve just done a check for anything above 0.999 though, should be good enough until I know more specifics about how they are doing things.

That is significant. I’ve heard this happens when comparing Azure vs OpenAI on the same model. But significant in the same model and service. Ok paranoid now :alien:

1 Like

Maybe this is related to OpenAI deciding to change temp ranges to be between 0.0 and 2.0. Because sometimes you just need your models temperature to be 1.0 higher….