This is what I’m doing. The database doesn’t matter, just have one, and search locally in memory on the vectors, and use the UUID to index back into the database to get the text. That’s basically it.
Also keep your embedding dimensions down to reduce sharding. So use ada-002, not Curie like the OP above mentioned. Just keep the dims low enough to give good performance, but not too small to mis-characterize the content being embedded.
The reason why yours is slower is because you are using a library that is coded for the general case. However, when using OpenAI embedding vectors, you only need to take the dot product, and not divide by the norm or length of the vector, since this value is always 1.
This is unnecessary computation, and is slowing you down. This is the danger of using libraries, since they aren’t often optimized for speed.
For speed, you need to interrogate the algorithm, and minimize the number of multiplies. This is why I also list the Manhattan metric in the comments of the code I list above. It is totally multiply-less if you drop normalizing for the vector dimension, and is your fastest option, and may be required for ridiculous billions or trillions of embeddings.
But at a minimum, drop the denominators when computing cosine similarity on unit vectors, and see where that speed lands you. Dropping those denominators puts you at 1/3 of the computation. So if it took 3 seconds before to search, it will now take 1 second.
@manyapps@curt.kennedy
In your mysql:
How many rows do you have? How do you index the database with vectors? How do you search for cosine similarity? Thanks
The cosine similarity search is done in memory, not in the database. So search in memory, then go back to the database to get the top text hits.
The database I use is DynamoDB.
You can have unlimited rows (caveats below).
For 1 second of latency, you chunk your memory into 400,000 embeddings each. And realistically per account you can have 500 of these running at the same time. So the realistic high end for one instance is 200 million rows. But this is per search, and you can rotate your data instantly (**) for another 200 million search, and all of your data (~trillions of rows) is in a single database.
So you create in-memory shards for search, and when you find what you are looking for, you retrieve the correlated text from the database. To do the 200 million row case, in 1 second (**), you need a layer that can async the searches, so use another DynamoDB table backed with a lambda to procure the final answer.
(**) This is all theoretical high end estimates, so don’t be surprised if there are a few more seconds added for “reality”.
If you have a few hundred to a few thousand embeddings, and don’t want to use the cloud, then do the whole thing in memory, and skip the database.
Thank you! In my case i will work locally with less than 100k vectors.
I will search similarities with this code in php, by loading all the vectors in a array.
function cosine_similarity($vector1, $vector2) {
$dot_product = 0.0;
$norm1 = 0.0;
$norm2 = 0.0;
// Check that the two vectors have the same size
if (count($vector1) !== count($vector2)) {
throw new Exception("vectors are different");
}
// Calculates the dot product, norm of the vectors
for ($i = 0; $i < count($vector1); $i++) {
$dot_product += $vector1[$i] * $vector2[$i];
$norm1 += pow($vector1[$i], 2);
$norm2 += pow($vector2[$i], 2);
}
// Calculate cosine similarity
$similarity = $dot_product / (sqrt($norm1) * sqrt($norm2));
return $similarity;
For cosine similarity on unit vectors you don’t need the squares and the square roots. I’m assuming unit vectors since ada-002 are all unit vectors. So you would only need your dot_product row, and return this (and you need to return the index of the max, see below).
BUT … think about what you are doing. This is only searching for similarity on vectors, but you need text as the output, vectors are meaningless to the human or LLM for summary. So you need the index of where the max is, so that you can locate the text behind the vector.
Also, I’m not sure how fast PHP is, but I use numpy under the hood for speed. This is another consideration.
PS … Also it looks like your code will return the cosine similarity of the last vector in the list. You aren’t checking for any max condition in your code. I would go back to the drawing board and think about what the code is really doing. Also I’m not seeing how you are iterating over the entire set of 100k vectors, this is just a computation between two vectors you are showing.
I’m following this https:// learn . microsoft . com/en-us/azure/ai-services/computer-vision/how-to/image-retrieval#calculate-vector-similarity
First I’ll get vectors for images from the Azure OpenAi API.
I’ll save the vectors in the mysql database with a reference to the image vectorized.
Then I will vectorize the query string needed to retrieve the images and I’ll loop the vectors in the table to calculate similarity and I’l get a score. In this loop I will use the function cosine_similarity().
According to the documentation it seems to me that the formula is correct, look at the C# function in the link. Please if you still believe i’m wrong I’ll check it again.
My purpose is to have a test about, at the moment I do similar things with fulltext search, and I have to rely on the tags I have for the image that are sometimes ambiguous and wrong and approximative.
Thanks!
Sure, the MS code is just computing the general case between two vectors. The formula is correct, but may be overkill for unit vectors. I’m thinking speed here, remember the 200 million row case I mentioned??? There, I would even consider a Manhattan metric, without multiplies.
So you need a loop one layer up that iterates over all the vectors, and the current vector you are trying to match. Try doing this in memory for speed. If you scan your DB, it will just be slow.
But nonetheless, the more you get your hands dirty, the more you learn. So do what feels right and report back here what worked and didn’t work.
It’s also open source. The tools they offer are incredible. The client libraries are very easy to use and works out of the box with numerous embedding models & generative AI. I run embeddings & Weaviate locally and it’s been great.
Their documentation is lacking but for tinkering it is a lot of fun (I have spent countless hours using their Explore function for fun).
You can self-host within actual minutes using their configuration tool:
Yep, that’s too… We also have a huge website with 3k posts and 2M monthly users where we added “related posts” based on currently displayed post + user history… a hell to maintain/operate if it were not Weaviate Cloud Services for a mere 25-40 USD/month.
Oh, no doubt. For a consistent, steady, production-ready database the Cloud Services are hands down the best. I love that they introduce new/potentially broken features to the self-hosted branch first before pushing it to the Cloud Services. Makes me a happy guinea pig.
Their slack channel is amazing for help as well.
I wonder what Pinecone would charge for that? They almost got me with their “first-time free” tactic
I configured mine for slow writing (to all clusters) and fast reading (from at least one cluster).
Writing done in background jobs. For queries I use raw text (Weaviate gets the vector for me using my keys and returns results based on that) - never even tested latency as never had issues with retrievals… But I bet it’s damn good compared to others, because I explicitly selected the cluster location close to my production server