BLIP/CLIP image/text embedding cosine similarity returning poor results

We are using transformers’s BlipModel and BlipProcessor to generate embeddings for frames of a video. An interesting experiment we performed was seeing if locally generating the embeddings for certain frames of a video returned the same vectors as generated by our remote server which downloads a video from S3 and generates the frame embeddings, and then stores these embeddings in a vector database which contains embeddings for all frames of all videos that we uploaded. After taking the dot product of the two (local and remote) vectors for the same frame, we found a similarity of only ~89%. It should be >99%. It’s worth noting we started with CLIPModel and CLIPProcessor but we were getting the same results.

We were able to verify that generating an embedding for the same textual description locally and remotely resulted in a vector similarity of >99.9%.

Happy to provide some code samples upon request. We are using a vector database to find the closest matches to a textual description of some frame, and the “nearest neighboring frames” using cosine similarity are not accurate at all-- they would be a frame of some completely different image than what we are describing. We know these models are more than capable of performing these tasks. What am I doing wrong here?

Can you share code snippets for the different ways you are calling the code locally vs on the cloud?