The embedding endpoint is great, but the dimensions of the embeddings are way too high, e.g., Curie (4096 dimensions). Some databases don’t have the capability of storing them for the prod purpose, or loading them at one query operation. What tools do you guys use to store a number of text chunks (more than 100) and the corresponding embeddings, which needs to be frequently updated and queried?
Which database tools suit for storing embeddings generated by the Embedding endpoint?
I am simply storing them in a sql database. I was thinking of moving them to a vector database like Pinecone or Weviate.
How do you use them once in the database?
I’m using Pinecone to store the vectors, then using Pinecone’s cosine similarly search to find relevant context. Unfortunately, it’s not cheap. Just playing around with some documents has already cost me $130, so I don’t know how realistic this is at scale.
Hi @georgei, it is necessary to write some code in python or another language to link the database to OpenAI’s API. The examples here provide a lot of knowledge:
@Imccallum It would be helpful to provide a link that specifically provides the script of linking the database to OpenAI’s API.
@arthur, wow, really not cheap. Will you consider Pinecone for production or just research purpose?
That’s still up in the air. I’m waiting to see how much it will cost for a beta tester’s documents. If it’s not realistic to cover the costs and have some margin, then I’ll look for another solution.
Weaviate also has an open-AI module so that you can automatically store data and vectorize it.
I found Weaviate to be more reasonable as price.
So far I made just a few tests, to check out their service.
Otherwise, their implementation is straightforward.
I was running into this issue as well. With so much text being generated from OpenAI, I needed a better way to store all this text into a database where I can do search on.
To resolve this I created a relational database that will handle embeddings automatically, so you can use SQL to do similarity search.
SELECT * FROM mldb.movie JOIN model.semantic_search ON model.semantic_search.inputs = mldb.movie.overview WHERE model.semantic_search.similar = 'story about a man with a low IQ chasing the love of his love' ORDER BY predictions.score DESC
Source code and docker is available here…
I’d recommend trying to switch away from curie embeddings and use the new OpenAI embedding model
text-embedding-ada-002, the performance should be better than that of curie, and the dimensionality is only ~1500 (also 10x cheaper when building the embeddings on OpenAI side)
If using Pinecone, try using the other pods, e.g. the
s1 pod, with that you’ll be storing vectors at 20% the cost of
p1 (as s1 stores 5x more vectors than p1)
Using SQL you will run into scalability issues very quickly, as you’re performing an exhaustive search across all vectors, I don’t know how slow searching through 1M vectors in a SQL database would be, but I can’t imagine it’s much fun
I created some docs+video on Pinecone about using OpenAI embeddings for semantic search, take a look here and here
For my project (hobby), I’m just dumping the embeddings as a text string into a MySQL column.
Pinecone and milvus
Paid and open source options
Chromadb is great for local development. They are working on a hosted version but before that’s live its hard to recommend for production just yet.