The embedding endpoint is great, but the dimensions of the embeddings are way too high, e.g., Curie (4096 dimensions). Some databases don’t have the capability of storing them for the prod purpose, or loading them at one query operation. What tools do you guys use to store a number of text chunks (more than 100) and the corresponding embeddings, which needs to be frequently updated and queried?
I am simply storing them in a sql database. I was thinking of moving them to a vector database like Pinecone or Weviate.
How do you use them once in the database?
I’m using Pinecone to store the vectors, then using Pinecone’s cosine similarly search to find relevant context. Unfortunately, it’s not cheap. Just playing around with some documents has already cost me $130, so I don’t know how realistic this is at scale.
Hi @georgei, it is necessary to write some code in python or another language to link the database to OpenAI’s API. The examples here provide a lot of knowledge:
@Imccallum It would be helpful to provide a link that specifically provides the script of linking the database to OpenAI’s API.
@arthur, wow, really not cheap. Will you consider Pinecone for production or just research purpose?
That’s still up in the air. I’m waiting to see how much it will cost for a beta tester’s documents. If it’s not realistic to cover the costs and have some margin, then I’ll look for another solution.
Weaviate also has an open-AI module so that you can automatically store data and vectorize it.
I found Weaviate to be more reasonable as price.
So far I made just a few tests, to check out their service.
Otherwise, their implementation is straightforward.
I was running into this issue as well. With so much text being generated from OpenAI, I needed a better way to store all this text into a database where I can do search on.
To resolve this I created a relational database that will handle embeddings automatically, so you can use SQL to do similarity search.
SELECT *
FROM mldb.movie
JOIN model.semantic_search
ON model.semantic_search.inputs = mldb.movie.overview
WHERE model.semantic_search.similar = 'story about a man with a low IQ chasing the love of his love'
ORDER BY predictions.score DESC
Source code and docker is available here…
I’d recommend trying to switch away from curie embeddings and use the new OpenAI embedding model text-embedding-ada-002
, the performance should be better than that of curie, and the dimensionality is only ~1500 (also 10x cheaper when building the embeddings on OpenAI side)
If using Pinecone, try using the other pods, e.g. the s1
pod, with that you’ll be storing vectors at 20% the cost of p1
(as s1 stores 5x more vectors than p1)
Using SQL you will run into scalability issues very quickly, as you’re performing an exhaustive search across all vectors, I don’t know how slow searching through 1M vectors in a SQL database would be, but I can’t imagine it’s much fun
I created some docs+video on Pinecone about using OpenAI embeddings for semantic search, take a look here and here
For my project (hobby), I’m just dumping the embeddings as a text string into a MySQL column.
Pinecone and milvus
Paid and open source options
Chromadb is great for local development. They are working on a hosted version but before that’s live its hard to recommend for production just yet.
I store generated embedding in AWS dynamodb then at the end I create python pickle/ cache on s3
When it was required I pull aws s3 pickle and start using it to find cosine similarity in Python code
I don’t know if it is the best way to do this, but it works it only takes 3-4 seconds to generate context.
This is literally what I do!
Also, if the pickle takes too long to search, split it up, create pickle “shards” and search them all in parallel.
For context, 400k embeddings in one pickle takes 1 second to search (for larger memory >5G, lambdas)
So if you have 4 million embeddings and need 1 second latency, you need 10 pickles and async this out to 10 lambdas reporting back to a database monitored by some other lambda for completeness and next steps.
It scales infinitely, is simple, and is inexpensive to run.
If you can follow what I’m saying, you should implement this and stop paying for vector DB’s!
@curt.kennedy
I have testes this with 20k records
Can you help me to suggest to improve current architects performance
My context generation code first load embedding pickle from s3 it takes 2 seconds
Then I create a data frame by loading and decoding pickle then I make a search which takes approx 4 seconds.
So overall operation takes 4-6 seconds, how i can improve this?
I have lambda memory size 3008 M
any suggestion to boost performance, where i have only 20 k records
I would be sure to create the data frame outside (above) the lambda_handler to make it a global in memory, and not executed each time. Or what I do, is avoid the data frame all together, and the pickle is a dictionary, where the keys are the hashes into the dynamodb database, and the values are the vectors, converted to numpy arrays.
The loading hit should only occur once at cold-start if you do it this way, since as long as the lambda is warm, you won’t see the load up.
But I strip out the keys in the DDB and embedding vectors, and put them in separate vectors separately (as globals outside the handler). So when you find the maximum inner product, you simply find the index of this, and use that index to get your hash into the DDB to pull your text.
Also not sure how you are searching, but a simple maximum inner product search, using the dot-product, is all that is required if you are using unit vector embeddings (which is what ada-002 uses). Example below:
def mips_naive(q, vecs):
mip = -1e10
idx = -1
for i, v in enumerate(vecs):
c = np.dot(q,v) # Manhattan for possibly more speed np.sum(np.abs(q-v))
if c > mip:
mip = c
idx = i
return idx, mip
For a project we run we store the vectors and context in MySQL. Seperate tables with matching unique UUID. A cosine similarity is run on the vectors and we retrieve the context data by UUID in the other table.