Which database tools suit for storing embeddings generated by the Embedding endpoint?

curt.kennedy · May 29, 2023, 3:52pm

This is what I’m doing. The database doesn’t matter, just have one, and search locally in memory on the vectors, and use the UUID to index back into the database to get the text. That’s basically it.

Also keep your embedding dimensions down to reduce sharding. So use ada-002, not Curie like the OP above mentioned. Just keep the dims low enough to give good performance, but not too small to mis-characterize the content being embedded.

indr.sidhu · May 30, 2023, 2:20am

I use, which is mentioned in open AI documentation, any recommendations?

def cosine_similarity(A, B):
    #return np.dot(A, B)/(norm(A)*norm(B))
    return np.dot(A, B) / (np.linalg.norm(A) * np.linalg.norm(B))

curt.kennedy · May 30, 2023, 3:39pm

The reason why yours is slower is because you are using a library that is coded for the general case. However, when using OpenAI embedding vectors, you only need to take the dot product, and not divide by the norm or length of the vector, since this value is always 1.

This is unnecessary computation, and is slowing you down. This is the danger of using libraries, since they aren’t often optimized for speed.

For speed, you need to interrogate the algorithm, and minimize the number of multiplies. This is why I also list the Manhattan metric in the comments of the code I list above. It is totally multiply-less if you drop normalizing for the vector dimension, and is your fastest option, and may be required for ridiculous billions or trillions of embeddings.

But at a minimum, drop the denominators when computing cosine similarity on unit vectors, and see where that speed lands you. Dropping those denominators puts you at 1/3 of the computation. So if it took 3 seconds before to search, it will now take 1 second.

sarthak.srivastava · June 28, 2023, 6:18am

how can i store embedding in mysql
pinecone is expensive

pat.bhatt · July 16, 2023, 9:23pm

Would you be willing to share the code to do this? Thank you.

fedepalu2 · September 20, 2023, 11:45pm

@manyapps @curt.kennedy
In your mysql:
How many rows do you have? How do you index the database with vectors? How do you search for cosine similarity? Thanks

curt.kennedy · September 22, 2023, 12:32pm

The cosine similarity search is done in memory, not in the database. So search in memory, then go back to the database to get the top text hits.

The database I use is DynamoDB.

You can have unlimited rows (caveats below).

For 1 second of latency, you chunk your memory into 400,000 embeddings each. And realistically per account you can have 500 of these running at the same time. So the realistic high end for one instance is 200 million rows. But this is per search, and you can rotate your data instantly (**) for another 200 million search, and all of your data (~trillions of rows) is in a single database.

So you create in-memory shards for search, and when you find what you are looking for, you retrieve the correlated text from the database. To do the 200 million row case, in 1 second (**), you need a layer that can async the searches, so use another DynamoDB table backed with a lambda to procure the final answer.

(**) This is all theoretical high end estimates, so don’t be surprised if there are a few more seconds added for “reality”.

If you have a few hundred to a few thousand embeddings, and don’t want to use the cloud, then do the whole thing in memory, and skip the database.

fedepalu2 · September 22, 2023, 1:02pm

Thank you! In my case i will work locally with less than 100k vectors.
I will search similarities with this code in php, by loading all the vectors in a array.

function cosine_similarity($vector1, $vector2) {

$dot_product = 0.0;
$norm1 = 0.0;
$norm2 = 0.0;

// Check that the two vectors have the same size
if (count($vector1) !== count($vector2)) {
    throw new Exception("vectors are different");
}

// Calculates the dot product, norm of the vectors
for ($i = 0; $i < count($vector1); $i++) {
    $dot_product += $vector1[$i] * $vector2[$i];
    $norm1 += pow($vector1[$i], 2);
    $norm2 += pow($vector2[$i], 2);
}

// Calculate cosine similarity
$similarity = $dot_product / (sqrt($norm1) * sqrt($norm2));

return $similarity;

}

do you think it will work?

curt.kennedy · September 22, 2023, 2:06pm

For cosine similarity on unit vectors you don’t need the squares and the square roots. I’m assuming unit vectors since ada-002 are all unit vectors. So you would only need your dot_product row, and return this (and you need to return the index of the max, see below).

BUT … think about what you are doing. This is only searching for similarity on vectors, but you need text as the output, vectors are meaningless to the human or LLM for summary. So you need the index of where the max is, so that you can locate the text behind the vector.

Also, I’m not sure how fast PHP is, but I use numpy under the hood for speed. This is another consideration.

PS … Also it looks like your code will return the cosine similarity of the last vector in the list. You aren’t checking for any max condition in your code. I would go back to the drawing board and think about what the code is really doing. Also I’m not seeing how you are iterating over the entire set of 100k vectors, this is just a computation between two vectors you are showing.

Just look at my code above for a python example:

fedepalu2 · September 22, 2023, 3:06pm

I’m following this https:// learn . microsoft . com/en-us/azure/ai-services/computer-vision/how-to/image-retrieval#calculate-vector-similarity

First I’ll get vectors for images from the Azure OpenAi API.

I’ll save the vectors in the mysql database with a reference to the image vectorized.

Then I will vectorize the query string needed to retrieve the images and I’ll loop the vectors in the table to calculate similarity and I’l get a score. In this loop I will use the function cosine_similarity().

According to the documentation it seems to me that the formula is correct, look at the C# function in the link. Please if you still believe i’m wrong I’ll check it again.

My purpose is to have a test about, at the moment I do similar things with fulltext search, and I have to rely on the tags I have for the image that are sometimes ambiguous and wrong and approximative.
Thanks!

curt.kennedy · September 22, 2023, 3:13pm

Sure, the MS code is just computing the general case between two vectors. The formula is correct, but may be overkill for unit vectors. I’m thinking speed here, remember the 200 million row case I mentioned??? There, I would even consider a Manhattan metric, without multiplies.

So you need a loop one layer up that iterates over all the vectors, and the current vector you are trying to match. Try doing this in memory for speed. If you scan your DB, it will just be slow.

But nonetheless, the more you get your hands dirty, the more you learn. So do what feels right and report back here what worked and didn’t work.

fedepalu2 · September 22, 2023, 4:02pm

Ok I go on with my dirty work and I’ll report my dirty result!

sergeliatko · September 22, 2023, 5:40pm

Personally, I use Weaviate Cloud Services. Reasons:

Choice of where the data is stored (Europe or USA)
Full compatibility with OpenAI (I give it texts and my API key, it handles the embeddings for me)
Easy to use (GraphQL, REST)
Clustering of data (multi tenancy)
Can be configured for fast reading vs writing vs balanced
Speed
Cheap… really Cheap for what it is (see the whole picture)

RonaldGRuckus · September 22, 2023, 5:51pm

+1 for Weaviate.

It’s also open source. The tools they offer are incredible. The client libraries are very easy to use and works out of the box with numerous embedding models & generative AI. I run embeddings & Weaviate locally and it’s been great.

Their documentation is lacking but for tinkering it is a lot of fun (I have spent countless hours using their Explore function for fun).

You can self-host within actual minutes using their configuration tool:

sergeliatko · September 22, 2023, 7:35pm

Yep, that’s too… We also have a huge website with 3k posts and 2M monthly users where we added “related posts” based on currently displayed post + user history… a hell to maintain/operate if it were not Weaviate Cloud Services for a mere 25-40 USD/month.

sergeliatko · September 22, 2023, 7:39pm

For my first implementation/configuration of it + learning curve I simply used GPT 4 chat…

RonaldGRuckus · September 22, 2023, 7:49pm

Oh, no doubt. For a consistent, steady, production-ready database the Cloud Services are hands down the best. I love that they introduce new/potentially broken features to the self-hosted branch first before pushing it to the Cloud Services. Makes me a happy guinea pig.

Their slack channel is amazing for help as well.

I wonder what Pinecone would charge for that? They almost got me with their “first-time free” tactic

sergeliatko · September 22, 2023, 7:54pm

Yes, that’s a big question for me too, as I have a client in my AI consulting branch who is about to use Pinecone for their new product…

Hey guys,
Does any one have an idea of costs in Pinecone for following:

3.5K objects
4.5M data points stored
2.8 billions data points queried

… per month?

curt.kennedy · September 22, 2023, 7:55pm

@RonaldGRuckus @sergeliatko

What would you guys say the latency of Weaviate is?

Say for 400,000 embeddings, or at the extreme 200,000,000 embeddings?

sergeliatko · September 22, 2023, 7:59pm

I configured mine for slow writing (to all clusters) and fast reading (from at least one cluster).

Writing done in background jobs. For queries I use raw text (Weaviate gets the vector for me using my keys and returns results based on that) - never even tested latency as never had issues with retrievals… But I bet it’s damn good compared to others, because I explicitly selected the cluster location close to my production server

Topic		Replies	Views
Storing embeddings in SQL Server? Latency between Redis & Pinecone? Vector DB recommendations? API	18	7223	December 23, 2023
Using Redis for embeddings API	21	12731	December 23, 2023
Creating a Chatbot using the data stored in my huge database Community embeddings , chatgpt , fine-tuning , api	93	75798	November 25, 2023
Reducing Cost of GPT 4 by using embeddings Prompting	23	10155	May 4, 2023
Introducing Embeddings Announcements	33	8703	November 27, 2023

Which database tools suit for storing embeddings generated by the Embedding endpoint?

Related topics