I’m looking at trying to store something in the ballpark of 10 billion embeddings to use for vector search and Q&A. I have a feeling i’m going to need to use a vector DB service like Pinecone or Weaviate, but in the meantime, while there is not much data I was thinking of storing the data in SQL server and then just loading a table from SQL server as a dataframe and performing cosine similarity in that df. Anyone have experience doing that, or did you find a way to execute cosine similarity as a SQL query?
Also wondering if anyone has tested Redis vs Pinecone or similar for their latency with high numbers of embeddings. If latency differed between the two, how much was the difference?
Finally, does anyone have any vector DB recommendations? Pinecone seems to be the gold standard, but i’m curious about how the alternatives stack up.
Do you need all the embeddings for every query. Some of our clients break their embeddings into categories, and use a different database for each area
Eg different areas of law, different topics within a University etc
10 billion embeddings is a lot. If each one is 100 tokens long, are you encoding 1 trillion words/tokens? is that correct?
Out of interest, how long is the text you are embedding for each entry? You may be able to combine entries in some way to have longer blocks of text, and therefore less embeddings - But it will depend on your use case.
We’re also breaking them into categories. There’s a lot of docs that are just sitting around- trying to make them accessible through semantic search and Q&A, and eventually to generate parts of documents.
We’re planning on encoding a large number of tokens, they probably average out to 200 tokens per embedding, but we haven’t done the initial doc scrape yet so no true number. I’m doing one paragraph per embedding, but if a paragraph is small enough, I append the previous paragraph to it to preserve the previous context. Some embeddings are entire pages if there is not paragraph break.
I’m leaning towards what you mentioned with breaking embeddings into different databases, not sure how that’s going to scale with 10 billion records but we shall see lol
Can you share more details why you decided to build own engine? Have you tried any recommended solutions like Pinecone and got unsatisfactory results? What was your scale? Is your own solution already working?
Our current solution is working well. We have several hundred clients with their own sets of data. Some of the sets are very large.
We did look at PineCone (and several other options)
Initially, we didn’t want to set up infrastructure for the Beta. Now it has gone live, we were so pleased with the performance that we decided to scale the engine instead of replacing it.
Because of the way it is coded, we can deploy and move client data between servers within seconds. This is ideal for load balancing etc.
It has been developed and tightly optimized for performance and minimal footprint.
We also have minimal latency by storing it in the same data center as our applications.
One of our (possibly unique) requirements was the ability to turn sections of the embedded data on and off for individual searches without losing the embedded vectors. I know we could do this with categorization etc - but we wanted a way to disable small groups or single vectors within a larger set. It was important that we could turn the vectors on again without having to rerun the embedding.
We do have a couple of massive clients coming on board. We may look at other engines if our tool can’t handle the load. (But I suspect OpenAI’s API bottleneck will be our first challenge to resolve)
We allow some of our users to see the contexts that go into making up an answer.
When they see the contexts, we wanted a way they could mark items not to be considered. This might be from a Bibliography, index, table of contents, or notes section.
We do this by marking the vector that generated the context as inactive.
Several other use cases require multiple records to be ignored for a single query. (Eg eliminating a specific document made up of multiple embeddings from a dataset. It may be irrelevant or have a bias we want to eliminate for the specific query)
That’s pretty ingenious- I was looking for a solution like this as well. I was using a patent and I broke the patent up into a bunch of smaller context chunks, then used Curie to summarize each chunk, and finally passed the summarized chunks back to Davinci as context. The issue I was having was if I asked a yes/no question (is this patent about x topic) and the summaries were mostly “no”, even if one summary was a “yes” the answer would be labeled as “no”. Having users manually select what is most relevant fixes that issue!
We created our vector database engine and vector cache using C#, buffering, and native file handling.
It is tightly coupled with Microsft SQL.
We did this so we don’t have to store the vectors in the SQL database - but we can persistently link the two together. Because of this, we can have vectors with unlimited meta data (via the engine we created)
Eg, If we get a semantic hit, we can get the original text, citations etc within the same query.