Can this api be used to query internal data?

Thank you. I would not call myself an expert though. I hope anyone who reads my posts critically evaluates them as if I was just a person on the street.

I think what you have clarified very well it with your post. Context & information delivery is very effective and it clearly shows that GPT-3 & 4 is more than capable of absorbing the information into its knowledge

Unless I’m mistaken, CustomGPT isn’t creating an LLM, they are building on the models available through the OpenAI API.

But, you are really good at explaining some of the technical aspects. There’s lots of proof in your body of commentary.

I also assume this is the case for a couple of reasons.

  1. The processing time it takes before the “model” is ready to test is very brief.
  2. The training corpus is a simple PDF document format - nothing special; no JSONL, etc.

This is partly what makes it so compelling for solutions that are very focused and relatively simple to automate the content-building process. Mileage may vary, of course, but so far…so good.

1 Like

@bill.french and @RonaldGRuckus, thank you for the response. Here is my understanding of things:

  1. It is not possible to add knowledge to any of the models OpenAI provides through their API. Their API allows for fine-tuning certain models, but the fine-tuning only conditions the model for a certain pattern of output; it does not actually add new knowledge to the model.

  2. Moreover, I am not aware of any fine-tuning method that allows you to reliably add knowledge to any models.

  3. My guess is that CustomGPT is actually doing retrieval augmented generation where the IR step is performed using a vector database that stores embeddings of external data. This is the approach described by @wfhbrian above, and it is pretty much how every service like CustomGPT provides this sort of functionality.

1 Like

That seems logical to me.

Indeed, however, it seems that with the right mix of these principles, we can mimic behaviours that users perceive as doing exactly that. Is that a valid way to explain what I’m experiencing?

Yes, precisely. Using retrieval augmented generation, it is possible to give the illusion that the model has gained new knowledge.

@kintela You definitely need to use embeddings. The way it works that that the embedding will map your text data to a vector (a list of numbers). Then a new piece of data comes in, say a question, and it gets embedded too. Then this gets compared to all the information you have embedded, one at a time (just multiply the numbers together and sum … this is called a dot-product). Then you have your top vectors that match. Then you take this, and get the corresponding text behind these vectors. Now you have similar or related text.

For example, after the search for most correlated vectors, you have data that is related to the incoming question. You then feed this data back into GPT, in the prompt, and ask GPT to answer the question based on all this related data. Then it should spit out an answer based on your data, even though GPT was never trained on your data, it is smart enough to absorb your data in the prompt and draw from it.

For an example notebook, check out THIS!

The link will show you the general concepts. Once you figure that out, it’s then a decision on how you will construct your embedding database. For me, I can create an in-memory data structure for searching, and also have a database to retrieve the actual text. I can do this in a serverless enviromnent with around 400k different embeddings with 1 second of latency. But there are SaaS offerings for vector databases too, such as Pinecone. They aren’t cheap, but they can be faster than my approach and handle much more data (think billions of embeddings).

Your input will have to be text. So you need to extract the text from your PDF files, Doc and XLS files to get it to work. There are tools that you can use to do this, or you can just copy and paste.

Another “hyper parameter” is how big of chunks of data should you embed at once (sentence, paragraph, page, thought, etc). I don’t think there is a hard and fast rule, and it depends on which GPT you are using (they all have different window sizes), so that is something to consider and experiment with. But give the embedding a large enough chunk that is coherent so that when a series of these chunks are stuffed into the prompt, it doesn’t look like a non-coherent jumbling mess to the AI.

You can even “micro embed”, I do this in my name similarity engine I built for my business (embed first and last names of people separately, and compare similarity). It all depends on your use-case and what makes sense.

3 Likes

Holy F*ING ST! My head just exploded. I’m so dense. Is it possible to perform a vector search by simply multiplying the array items? You said this to me in another thread and I totally did not see the use of the dot-product. This is reliable?

2 Likes

@bill.french Haha Bill, yes, that’s all it is. If your embeddings are unit vectors, which they all are from OpenAI, you only need to use the dot-product, which, sadly, is just point wise multiply the array terms and sum the result. The largest number you can get is +1, and the smallest is -1. But due to the non-isotropic nature of OpenAI’s embeddings, you will not see anything much less than +0.7. But that is another story.

1 Like

Apologies to @kintela for hijacking the topic.

Okay - this is so much more simplified than I imagined. Given this …

"embedding": [
        -0.006929283495992422,
        -0.005336422007530928,
        ...
        -4.547132266452536e-05,
        -0.024047505110502243
      ],

I multiply …

embedding[0] * embedding[1] * embedding[2] * embedding[3] * embedding[4] *  ...

And that leaves me with a product that I can compare to other embeddings with the same treatment. The smaller the delta, the closer the similarity, right?

… and sum the result.

This part I’m not getting.

You would actually multiply the value of each vector in the embeddings that you are comparing, and then sum it all up

so it’d be

(first_embedding[0] * second_embedding[0]) + (first_embedding[1] * second_embedding[1]) + […]

1 Like

I see. So, in a sizeable list of comparisons, that’s a lot of crunching.

In your initial example, where you multiplied all the terms together, that is N-1 multiplications, and the actual dot-product between two vectors has N multiplications, so its essentially the same.

Dot-products are the backbone of AI, and it’s not because of embeddings, its because when you multiply two matrices together, you are taking various dot-products of all the rows and columns of the two matrices. And this matrix multiplication is the backbone of AI. The only thing following the matrix multiply is the non-linear activation function, and these are not as computationally expensive as the matrix multiplies (which are lots and lots of dot-products).

So if you like AI, then you like dot products! OK, maybe not, but that is basically all that is going on computationally behind the scenes.

Just for reference, and general appreciation of the dot-product, I will take the dot-product of the vector [1, 2, 3] with the vector [4, 5, 6]. The answer is 1*4 + 2*5 + 3*6 = 32. See, was that so hard?

1 Like

Okay, you guys have really opened my eyes. Ergo?

let qAsked;
let qAskedE;

// define the dot product
let dot = (a, b) => a.map((x, i) => a[i] * b[i]).reduce((m, n) => m + n);

// get the trained embedding
let qTrained  = "Does CyberLandr have its own battery?";
let qTrainedE = JSON.parse(getEmbedding_(qTrained)).data[0].embedding;

// ask the question a similar way
qAsked    = "Is there a battery in CyberLandr?";
qAskedE   = JSON.parse(getEmbedding_(qAsked)).data[0].embedding;
Logger.log(dot(qTrainedE, qAskedE)); // 0.9667737555722923

// ask the question a less similar way
qAsked    = "Does CyberLandr get its power from Cybertruck?";
qAskedE   = JSON.parse(getEmbedding_(qAsked)).data[0].embedding;
Logger.log(dot(qTrainedE, qAskedE)); // 0.9177178345509003

I think I’m about to start a love affair with dot products. Thanks to both of you!

So, @kintela, to rejoin the point of your question, we circle back to @wfhbrian’s original response…

I was hip to embeddings before I understood the mechanics of embeddings, and I knew in your case, the most practical and cost-effective way to build your solution was probably with embeddings.

2 Likes

@cliff.rosen Thanks for the clarification questions and thank you @bill.french for the detailed response.

Just to clarify: CustomGPT does not build new LLMs. We take your training data and use embeddings to pass context to the ChatGPT API. This is very similar to how the new Plugins functionality works. If you read the Plugins reference code that OpenAI put up, we do all that (and some more). So its basically available for customers in a no-code platform. Our customers come and upload their data and/or websites and build bots from them (behind the scene, we use embeddings and completions)

1 Like

It sure is. Works like a charm. (And very nicely explained by @curt.kennedy )

Though as you start dealing with large datasets, there are other issues that crop up when calculating the context (like duplicates). And then once you’ve eliminated the duplicates, you will look at the resulting context and say “Huh? Can I make this context better for my use case?” – so that kicks off a “post vector search” optimization initiative.

2 Likes

Ergo, this is where mere mortals realize - “I should have used CustomGPT so my company could be using this three weeks ago.”

hey guys following up on this discussion

I’m actually maintainer of an open source api that solves this problem of connecting your data to ChatGPT/GPT-4 or building plugins. GitHub - different-ai/embedbase: A dead-simple API to build LLM-powered apps

Regarding duplicates,

embedbase never compute embedding twice through some tricks, saving everyone a lot of money

you can also do semantic search across multiple datasets

feel free to give a try by running it yourself or trying the free hosted version :slight_smile:

3 Likes

To avoid duplicates, I keep track of what I have embedded previously with the HASH of the previous embedding text, and use this as an index into my embedding database.

For example, “mouse” has a Sha3-256 hash of

import hashlib

X = "mouse"
H = hashlib.sha3_256(X.encode())
print(f"HASH: {H.hexdigest()}")
# HASH: 6ca66ca0a1713279fbdc1d0f5568d033d3593004251c3f564186a4e4e284cdeb

Then whenever I embed anything else, I compute the hash, and see if it is in my embedding database, if it is, then I don’t have to embed it, I just pull the previous embedding vector. You won’t have duplicates if you do this!

Note that GPT is case sensitive, so “Mouse” is different than “mouse”, and luckily this results in a separate hash too:

import hashlib

X = "Mouse"
H = hashlib.sha3_256(X.encode())
print(f"HASH: {H.hexdigest()}")
# HASH: 4c2e2fe9ae1d56701bea18593b67dc59d862106f959a132c640352780b5d0339

You can go with lower-case hashes too, but realize GPT “sees” that “Mouse” is different than “mouse”.

Note: Sha3-256 is probably overkill, but that’s what I use these days.

Oh, and to be clear, this is only on the database/lookup side. In my case, for search, I scan the database to create an in-memory data structure that is a python dict, keys are the hash and the numpy version of the embedding vector. Then this is saved as a pickle to S3, and loaded in-memory when I am ready to search. So you will periodically update this file as your embedded data (knowledge) changes over time.

1 Like

I’ve begun to realize that to do AI projects really well, with excellent efficiency, and financially practical – everything we learned through the ages of computing is suddenly more relevant than ever before.

2 Likes

@louis030195 Love the open source project – nicely done!

The duplicate removal method you mentioned seems to remove duplicate “exact match” strings, right? This scenario almost never happens - because due to the chunking, the strings will be off a little and not become exact duplicates. You can try it out with web content pages and you will see what I mean. What’s needed is a “semantic duplicate” removal.

We tried that as the first approach (the md5 hash) – it worked only in the case of exact match duplicates (see above). The problem is: When chunking web pages, the chunk will always be off by 1-2 characters and then you get duplicates like “Copyright 2023 : Curt Kennedy”. So a semantic duplicate removal is needed. But this approach with the hash works great as a quick spot fix.

1 Like