Hello everyone. I think I don’t get the differences (and pros and cons) of these two approaches to building a chatbot based on GPT-3 with a custom knowledge base based on documents. The approaches I am referring to are:
use Llama Index (GPT-Index) to create index for my documents and then Langchain. Like this Google Colab
use langchain embeddings (which if i understood correctly is more expensive because you pay both for api tokens and for embedding tokens). Like this Google Colab
Could you please help me understand the differences? thank you.
For embeddings, you only pay once for your core data to be embedded if you save off your results to a database. Assuming you are using ada-002 for embeddings, it is at $0.0004 per 1k Tokens (so a few orders magnitude cheaper than a completion).
You would have to experiment since both can potentially create large input prompts. LangChain is more flexible, you can call non-GPT logic, whereas a straight embeddings approach is more straightforward (IMO). So it depends on your exact situation. But depending on what you are really trying to do, you can implement both, test it, and see what works best for you. If you do, report back! I’d be curious.
And by “database”, you probably need to use a vector database like Pinecone or Weaviate. These make it so much easier to take any query, get the vector, and then match it up to the best points in your vector database.
My only fear is that your investment into getting and storing all of the vectors for a given data set may be at risk if the model used to get future query vectors has changed significantly. Would a new version of text-embedding-ada-002 create an impedance mismatch between your historical vector store and the newer model?
You should be “worried” Bill. There is an impedance mismatch. However, the cost of re-generating is low to free. Actually, the cost of having the database itself is usually the dominant term. So no need to worry, since that will always be there.
This is not really necessary nor true and it is basically “tech hype”, in my view, to be honest.
Storing searlized data in databases is a technology “as old has the hills” and even PHP forums going back two decades store large arrays of data as serialized data (which are more complex than single-dimensional serialized embedding vectors).
In other words, it is trivial for any experienced webdev to store embedding vectors in a DB as a serialized object and to query the DB and preform the linear algebra fun and games with these vectors.
In fact, I do this very thing with OpenAi embedding vectors on a daily basis using a DB, and here is an example from one of my Rails projects models, showing the fact that the actual vector is serialized by the DB automatically (basically a long-established built-in DB function to serialize arrays):
class Embedding < ApplicationRecord
serialize :vector, Array
This simple DB model above contains both the text (chucks) and the embedding vectors serialized and is very fast. When we want even more speed we simply use Redis and it’s blazing fast in memory.
To be honest, I don’t use Pinecone or Weaviate but I have looked at them before and as I recall, these DBs are basically network services; which means they require network calls. They seem good at marketing their services as “must haves” for vectors, and good on them for marketing, but having a database “on the same network as your app” which requires no external network calls is actually more reliable from a network system engineering perspective and less costly since MySQL or PostgresSQL, etc are basically free but these “vector DB services” are not free. Furthermore, most large organization already have very competent DB admin teams.
Databases are basically “free” for many developers. MySQL, PostgreSQL, etc are free to download and any experienced app developer can easily set up these DB to work fast with vectors, especially OpenAI vectors which are single dimensional.
Well, agree that there is a lot to worry about in tech, and this is just one of 100s or even 1000s of things we system engineers and software developers can worry about “in the future”.
If we apply the “worry about the future” logic, then we can be “afraid” that vector DB services might go out of business, go offline, be hacked, prices increased, etc to infinity; so as a systems engineer as well as a developer, my objective is to apply “future concerns” equally across the software engineering spectrum with being caught up in the “tech hype of the year” cycle.
We also know, BTW, that OpenAI uses PostgreSQL DB and not these third party vector DB services and OpenAI recently announced they are scaling up their PostgresSQL infrastructure to help them with exponential growth.
Just keeping things objective from an independent, system engineering perspective.
@ruby_coder The database cost (for me) is still the dominant term. For whatever reason, data in the cloud is expensive (why? … I have no idea). But, luckily I don’t need to use vector databases for search, just in-memory data structures that I code myself, but to scale up to billions of embeddings, you should look into vector databases for quick vector searches. There are theoretical reasons for this, and I am thinking of the algorithm in FAISS now.
As for SQL DB’s being cheap … in the cloud … no way! Go serverless and forget about it! For any SQL DB in the cloud you pay by hour. I pay by read/writes or I go provisioned.
@bill.french I think revectoring is the cost of maintenance!
Everyone’s infrastructure requirements are different.
As mentioned, I was not commenting on cloud services, I mentioned it was free to host these DB on your own network.
In addition, there is a big difference between running your own databases (which my clients and I do, many internally and many hosted in data centers) and buying DB “cloud services”.
The issue I have here is that people often post “solutions” without getting down into the weeds of an organization’s capabilities, current IT infrastructure, employee skill sets, prior investments, etc.
The fact of the matter is, and I am sure you agree, is that there is no technical reason you cannot use a SQL database to store vectors as serialized data in the DB; and that doing this (storing vectors, arrays, etc) as serialized data in SQL DB has been around for a very long time and it works very well.
It is not a “requirement” in this field to use vector-based DBs “as a service” to build reliable and scaleable embedding applications successfully.
I agree that vector DB’s are WAAY overhyped, and only fill a niche of maybe 1% of the systems out there. Whereas most folks, like me, can get away with two things … in-memory search (using naive/linear techniques), followed up by simple DB lookups (serverless here, it’s cheap!).
It’s better to start simple, for sure. NO VECTOR DB’s!!! Grow to them if you have to, but not the first choice.
Your GPU might pair well with the open source Facebook AI Similarity Search (FAISS). But if you have less than 1 million embeddings, like discussed above, you can do this “by hand” with the naive searches like this:
def mips_naive(q, vecs):
mip = -1e10
idx = -1
for i, v in enumerate(vecs):
c = np.dot(q,v) # dot is the same a cosine similarity for unit vectors
if c > mip:
mip = c
idx = i
return idx, mip
The claim that OpenAI uses PG ergo vector DBs are not useful is about as credible that all the startups that put Fortune 100 company logos on their websites because somewhere, one mid-level dev with a corporate card once signed up for a trial and might have forgotten to turn it off and billed a month ergo “Apple uses our product”.
Databases are tools.
While I share your skepticism of “hype” and think VCs have rushed into raising enormous rounds for vector DB startups at insane valuations without truly understanding the specialized nature of the product and segment, I think your post here might do the opposite: discount the very utility of such a specialized tool.
Can you fasten/remove a Torx bolt by jamming just the right size Phillips drivers onto it and will it work in a jam, for a single little project, or a few times? SURE. Will it come and bite you later if you try to confuse it for a Torx driver? Yes. Without a doubt. Is it the right tool for the job for a professional, at scale, wanting to give their customer the best work? No. Objectively: no.
Postgres is amazing. What a wonderful general purpose data store it is! It even has some incredible plug ins. But the very fact that its wire protocol has been used to reimplement the actual engine for things like time series, active-active, sharding, and horizontal scale tells us a very important fact: it is not the silver bullet you are making it out to be.
Your commentary is hardly objective nor based in “system engineering”: I am sure OpenAI uses Postgres. I am sure they use it for its strengths (like transactional data, HRIS applications, or the myriad other things any business does). If it underpins their actual technology as a primary vector store, I would guess that it is only with some very, very advanced, proprietary pg_* plugins, storage layers, etc that basically turn it into a CockroachDB style implementation of where it’s just the PG wire protocol talking to an enormously different storage engine (read: NOT at all Postgres).
I love me some PG just as much as the next guy, and think this “Vector DBs are the greatest thing since sliced bread and will solve all my problems and make everything else obsolete” is just as crazy as “Vector DBs are just a fad, meh, Postgres FTW”.
If you can actually substantiate that OpenAI is using vanilla-ish (or close to it) PG (and its actual storage, query, etc engine) for actual OpenAI vector or embedding use, I encourage you to substantiate your claim, but I suspect that’s not possible because 1) that information is largely proprietary 2) we know that PG as a datastore is not built for that at even a fraction of a percent of OpenAI’s scale. I am sure vanilla-ish PG exists in their ERP, CRM, etc systems abound, but that’s a specious argument to confuse that with the actual service delivery stack to try and discredit Vector DBs.
EDIT: Also, let’s not confuse pg_openai and attendant end user functions/stored procedures/UDFs with what it takes to run OpenAI’s service delivery fabric.