BLIP/CLIP image/text embedding cosine similarity returning poor results

richter · January 18, 2023, 7:53pm

We are using transformers’s BlipModel and BlipProcessor to generate embeddings for frames of a video. An interesting experiment we performed was seeing if locally generating the embeddings for certain frames of a video returned the same vectors as generated by our remote server which downloads a video from S3 and generates the frame embeddings, and then stores these embeddings in a vector database which contains embeddings for all frames of all videos that we uploaded. After taking the dot product of the two (local and remote) vectors for the same frame, we found a similarity of only ~89%. It should be >99%. It’s worth noting we started with CLIPModel and CLIPProcessor but we were getting the same results.

We were able to verify that generating an embedding for the same textual description locally and remotely resulted in a vector similarity of >99.9%.

Happy to provide some code samples upon request. We are using a vector database to find the closest matches to a textual description of some frame, and the “nearest neighboring frames” using cosine similarity are not accurate at all-- they would be a frame of some completely different image than what we are describing. We know these models are more than capable of performing these tasks. What am I doing wrong here?

logankilpatrick · February 7, 2023, 7:36pm

Can you share code snippets for the different ways you are calling the code locally vs on the cloud?

Topic		Replies	Views
Inconsistent Embedding Results for my dataset API embeddings	1	73	November 14, 2024
Embeddings for the same content vary. How can this be fixed? API embeddings	2	531	May 24, 2024
Embeddings and Cosine Similarity API	20	14191	February 25, 2024
Semantic Textual Similarity - undifferentiated similarities API embeddings , semantic-search	5	1493	December 24, 2023
How can I optimize the data I am embedding to increase vector search result quality? API embeddings , api , gpt-4o-mini	2	194	August 31, 2024

BLIP/CLIP image/text embedding cosine similarity returning poor results

Related topics