ChatGPT & llamaindex & embeddings

Sissico · March 15, 2023, 4:36pm

Hello everyone. I think I don’t get the differences (and pros and cons) of these two approaches to building a chatbot based on GPT-3 with a custom knowledge base based on documents. The approaches I am referring to are:

use Llama Index (GPT-Index) to create index for my documents and then Langchain. Like this Google Colab
use langchain embeddings (which if i understood correctly is more expensive because you pay both for api tokens and for embedding tokens). Like this Google Colab
Could you please help me understand the differences? thank you.

curt.kennedy · March 15, 2023, 5:15pm

For embeddings, you only pay once for your core data to be embedded if you save off your results to a database. Assuming you are using ada-002 for embeddings, it is at $0.0004 per 1k Tokens (so a few orders magnitude cheaper than a completion).

You would have to experiment since both can potentially create large input prompts. LangChain is more flexible, you can call non-GPT logic, whereas a straight embeddings approach is more straightforward (IMO). So it depends on your exact situation. But depending on what you are really trying to do, you can implement both, test it, and see what works best for you. If you do, report back! I’d be curious.

bill.french · March 15, 2023, 11:11pm

And by “database”, you probably need to use a vector database like Pinecone or Weaviate. These make it so much easier to take any query, get the vector, and then match it up to the best points in your vector database.

My only fear is that your investment into getting and storing all of the vectors for a given data set may be at risk if the model used to get future query vectors has changed significantly. Would a new version of text-embedding-ada-002 create an impedance mismatch between your historical vector store and the newer model?

Color me worried.

curt.kennedy · March 16, 2023, 12:47am

You should be “worried” Bill. There is an impedance mismatch. However, the cost of re-generating is low to free. Actually, the cost of having the database itself is usually the dominant term. So no need to worry, since that will always be there.

bill.french · March 16, 2023, 12:53am

True, the cost is so small. Revectoring, then, is a reasonable cost of maintenance.

ruby_coder · March 16, 2023, 1:56am

Hi @bill.french

This is not really necessary nor true and it is basically “tech hype”, in my view, to be honest.

Storing searlized data in databases is a technology “as old has the hills” and even PHP forums going back two decades store large arrays of data as serialized data (which are more complex than single-dimensional serialized embedding vectors).

In other words, it is trivial for any experienced webdev to store embedding vectors in a DB as a serialized object and to query the DB and preform the linear algebra fun and games with these vectors.

In fact, I do this very thing with OpenAi embedding vectors on a daily basis using a DB, and here is an example from one of my Rails projects models, showing the fact that the actual vector is serialized by the DB automatically (basically a long-established built-in DB function to serialize arrays):

class Embedding < ApplicationRecord
    serialize :vector, Array
end

This simple DB model above contains both the text (chucks) and the embedding vectors serialized and is very fast. When we want even more speed we simply use Redis and it’s blazing fast in memory.

 create_table "embeddings", force: :cascade do |t|
    t.string "openai_id"
    t.string "model"
    t.string "prompt"
    t.string "vector"
    t.datetime "created_at", precision: 6, null: false
    t.datetime "updated_at", precision: 6, null: false
  end

To be honest, I don’t use Pinecone or Weaviate but I have looked at them before and as I recall, these DBs are basically network services; which means they require network calls. They seem good at marketing their services as “must haves” for vectors, and good on them for marketing, but having a database “on the same network as your app” which requires no external network calls is actually more reliable from a network system engineering perspective and less costly since MySQL or PostgresSQL, etc are basically free but these “vector DB services” are not free. Furthermore, most large organization already have very competent DB admin teams.

Databases are basically “free” for many developers. MySQL, PostgreSQL, etc are free to download and any experienced app developer can easily set up these DB to work fast with vectors, especially OpenAI vectors which are single dimensional.

Well, agree that there is a lot to worry about in tech, and this is just one of 100s or even 1000s of things we system engineers and software developers can worry about “in the future”.

If we apply the “worry about the future” logic, then we can be “afraid” that vector DB services might go out of business, go offline, be hacked, prices increased, etc to infinity; so as a systems engineer as well as a developer, my objective is to apply “future concerns” equally across the software engineering spectrum with being caught up in the “tech hype of the year” cycle.

We also know, BTW, that OpenAI uses PostgreSQL DB and not these third party vector DB services and OpenAI recently announced they are scaling up their PostgresSQL infrastructure to help them with exponential growth.

Just keeping things objective from an independent, system engineering perspective.

Hope the helps.

curt.kennedy · March 16, 2023, 3:56am

@ruby_coder The database cost (for me) is still the dominant term. For whatever reason, data in the cloud is expensive (why? … I have no idea). But, luckily I don’t need to use vector databases for search, just in-memory data structures that I code myself, but to scale up to billions of embeddings, you should look into vector databases for quick vector searches. There are theoretical reasons for this, and I am thinking of the algorithm in FAISS now.

As for SQL DB’s being cheap … in the cloud … no way! Go serverless and forget about it! For any SQL DB in the cloud you pay by hour. I pay by read/writes or I go provisioned.

@bill.french I think revectoring is the cost of maintenance!

Everyone’s infrastructure requirements are different.

ruby_coder · March 16, 2023, 4:04am

As mentioned, I was not commenting on cloud services, I mentioned it was free to host these DB on your own network.

In addition, there is a big difference between running your own databases (which my clients and I do, many internally and many hosted in data centers) and buying DB “cloud services”.

The issue I have here is that people often post “solutions” without getting down into the weeds of an organization’s capabilities, current IT infrastructure, employee skill sets, prior investments, etc.

The fact of the matter is, and I am sure you agree, is that there is no technical reason you cannot use a SQL database to store vectors as serialized data in the DB; and that doing this (storing vectors, arrays, etc) as serialized data in SQL DB has been around for a very long time and it works very well.

It is not a “requirement” in this field to use vector-based DBs “as a service” to build reliable and scaleable embedding applications successfully.

HTH

curt.kennedy · March 16, 2023, 4:10am

I agree that vector DB’s are WAAY overhyped, and only fill a niche of maybe 1% of the systems out there. Whereas most folks, like me, can get away with two things … in-memory search (using naive/linear techniques), followed up by simple DB lookups (serverless here, it’s cheap!).

It’s better to start simple, for sure. NO VECTOR DB’s!!! Grow to them if you have to, but not the first choice.

ruby_coder · March 16, 2023, 4:33am

Agreed, of course,

I have had good luck with standard “SQL DB” as mentioned and if speed becomes an issue on the server side, I can easily add Redis.

From what I have seen here, the limiting performance factor is not DBs on the user / client side, but the OpenAI infrastructure performance, specially in the recent turbo models.

I agree that for people building their first apps and creating a user base, etc. that there is really no need to go "vector DB’ when this can easily be done with “traditional” SQL DBs.

In addition, when searching text, vectoring keywords and short phrases provide very poor results compared to DB keyword and / or DB full-text searches.

When a system designer moves (for example) to a “fully vectorized DB approach” they can lose the ability to use standard full-text DB searches when more optimal than vector-based semantic search.

That is why, in my view and I agree with you @curt.kennedy , it is good to not fall “for the hype” and to start off simple as you have said @curt.kennedy.

There is no shortage of “hype” and no shortage of people, start-ups, etc hoping to profit off OpenAI technology and the current hype.

bill.french · March 16, 2023, 4:57am

This is really good to know. I was convinced by an OpenAI blog post a vector database was required.

bill.french · March 16, 2023, 5:06am

This is very helpful. Imagine I wanted to perform embedding search on say, a Jetson. Could the database provide an edge-based solution without connectivity?

ruby_coder · March 16, 2023, 5:24am

You search for something on the comic “The Jetsons” ?

Sorry, but it’s hard to answer your question without specifying was “a Jetson” is in your question? You mean George Jetson or his son Ellroy or Jane, his wife? Or daughter Judy?

bill.french · March 16, 2023, 1:20pm

Ha ha! Well, Jane was hot, but not as hot as a NVIDIA Jetson running at MAXN with six cores.

Yeah, words have meaning; we need to make sure we use enough of them.

Pinecone, as you know, cannot run on-prem. My requirements for this product is to perform embedding searches during periods of disconectivity.

curt.kennedy · March 16, 2023, 2:40pm

Your GPU might pair well with the open source Facebook AI Similarity Search (FAISS). But if you have less than 1 million embeddings, like discussed above, you can do this “by hand” with the naive searches like this:

def mips_naive(q, vecs):
    mip = -1e10
    idx = -1
    for i, v in enumerate(vecs):
        c = np.dot(q,v) # dot is the same a cosine similarity for unit vectors
        if c > mip:
            mip = c
            idx = i
    return idx, mip

Also you could use Redis, see this thread: Using Redis for embeddings

FP · April 12, 2023, 4:31pm

Doesn’t Pinecone provide the ability to query by cosine similarity, meaning Pinecone performs the task of both storing the vectors and performing the linear algebra?

How do you find similarity in your method? You query (import) all the stored vectors and compute cosine similarity in a loop against your query?

chengyineng · April 24, 2023, 12:10am

Do you have a link to this post that mentions their use of PostgreSQL DB and the announcement of scaling up infrastructure? I’m curious to read more. Thanks!

moltar · May 4, 2023, 2:49am

The claim that OpenAI uses PG ergo vector DBs are not useful is about as credible that all the startups that put Fortune 100 company logos on their websites because somewhere, one mid-level dev with a corporate card once signed up for a trial and might have forgotten to turn it off and billed a month ergo “Apple uses our product”.

Databases are tools.

While I share your skepticism of “hype” and think VCs have rushed into raising enormous rounds for vector DB startups at insane valuations without truly understanding the specialized nature of the product and segment, I think your post here might do the opposite: discount the very utility of such a specialized tool.

Can you fasten/remove a Torx bolt by jamming just the right size Phillips drivers onto it and will it work in a jam, for a single little project, or a few times? SURE. Will it come and bite you later if you try to confuse it for a Torx driver? Yes. Without a doubt. Is it the right tool for the job for a professional, at scale, wanting to give their customer the best work? No. Objectively: no.

Postgres is amazing. What a wonderful general purpose data store it is! It even has some incredible plug ins. But the very fact that its wire protocol has been used to reimplement the actual engine for things like time series, active-active, sharding, and horizontal scale tells us a very important fact: it is not the silver bullet you are making it out to be.

Your commentary is hardly objective nor based in “system engineering”: I am sure OpenAI uses Postgres. I am sure they use it for its strengths (like transactional data, HRIS applications, or the myriad other things any business does). If it underpins their actual technology as a primary vector store, I would guess that it is only with some very, very advanced, proprietary pg_* plugins, storage layers, etc that basically turn it into a CockroachDB style implementation of where it’s just the PG wire protocol talking to an enormously different storage engine (read: NOT at all Postgres).

I love me some PG just as much as the next guy, and think this “Vector DBs are the greatest thing since sliced bread and will solve all my problems and make everything else obsolete” is just as crazy as “Vector DBs are just a fad, meh, Postgres FTW”.

If you can actually substantiate that OpenAI is using vanilla-ish (or close to it) PG (and its actual storage, query, etc engine) for actual OpenAI vector or embedding use, I encourage you to substantiate your claim, but I suspect that’s not possible because 1) that information is largely proprietary 2) we know that PG as a datastore is not built for that at even a fraction of a percent of OpenAI’s scale. I am sure vanilla-ish PG exists in their ERP, CRM, etc systems abound, but that’s a specious argument to confuse that with the actual service delivery stack to try and discredit Vector DBs.

EDIT: Also, let’s not confuse pg_openai and attendant end user functions/stored procedures/UDFs with what it takes to run OpenAI’s service delivery fabric.

moltar · May 4, 2023, 2:53am

Don’t hold your breath. OpenAI is not using Postgres in lieu of specialized vector data stores.

louis030195 · May 9, 2023, 3:20pm

Keep in mind that people out there pay a monthly fee for feature flags as a service. There’s definitely a market for OP’s product.

Topic		Replies	Views
Which database tools suit for storing embeddings generated by the Embedding endpoint? API	46	26030	December 13, 2023
How to fine tune a chatbot for Q&A API	12	8472	December 16, 2023
About the usage of ChatGPT Embedding API	9	4464	August 18, 2023
Best architecture for searching historical emails semantically? API	25	5299	August 22, 2024
Reducing Cost of GPT 4 by using embeddings Prompting	23	10546	May 4, 2023

ChatGPT & llamaindex & embeddings

Related topics