Can this api be used to query internal data?

Hi

I would like users in my company to be able to consult the thousands of pdf, doc and xls files that have been generated for more than 30 years, making offers through a chat bot and through natural language and I don’t know if this is possible .

I don’t know if I have to train or generate my own model or I don’t know if I have to give all this data to open ai so that it can incorporate it into its own

I would appreciate any guidance on this.

Thanks

2 Likes

Hi @kintela

The most likely way that you’ll want to achieve this is with embeddings.

OpenAI has its own embeddings models, which I recommend using for its ease of use.

It seems like you want to provide a conversational interface, like ChatGPT, to your users. The way this works is by querying your files by using their embeddings and then providing those results to the conversational model before answering the user’s question.


image source

6 Likes

Hi @bill.french,

Perhaps you can clarify something for me about CustomGPT. My understanding is that current methods do not allow for foundational models to absorb new knowledge without issues like catastrophic forgetting. As such, I don’t see how CustomGPT could create an LLM containing custom knowledge without redoing the pre-training process, which would surely be cost prohibitive. Yet, I believe you are suggesting this is exactly what they do, and their website seems to support that idea as well.

Would much appreciate if you could shed some light on this. Just what does CustomGPT do in the way of creating custom models with new data?

Thanks,

Cliff

I’ll do my best. I’m not the creator, but I have had good results with it.

At the outset, I’m not certain this is accurate, but I defer to some experts here, like @curt.kennedy, @alden, and @RonaldGRuckus, for this deeper question. However, I think you are looking at it from a different polarity - it’s not that this approach is fine-tuning the underlying core GPT model; rather, [I think] it’s drawing upon it to create more intelligence where it is not explicitly stated in the training corpus.

I built this approach to automate and compress the time it takes for anyone to build and train a model focused on a specific topic.

I think the above passage is where I intimated your takeaway and misled you. I’m using “build and train a model” in a generic sense. When CustomGPT is finished processing, it hands me what I perceive to be a “model” capable of conducting Q&A sessions with relatively good success. Is it embedding-based? Probably. Is it a hosted model and solution - yeah. My experience is very early with this tool, but @alden can probably chime in with more insights.

I was sceptical as well. So here’s an example.

I have a training corpus that you can peruse that does not mention surfboards or bears. It was used to create the CustomGPT solution mentioned above and contains about 100 FAQs about CyberLandr.

Surprisingly, when I test the solution with these questions, they work very well to answer predicted and also some known questions our customers have asked. I specifically focused on these examples to demonstrate cases that didn’t seem likely but actually worked. The only way I think these answers are possible is that there is a blending of the LLM’s knowledge in the context of what it knows about CyberLandr (which isn’t much BTW) and the prompt itself.

How this actually works under the covers is beyond my knowledge, but it seems to do a pretty good job of simplifying the process and standing up a solution that performs mostly as hoped.

How did this “model” know that bears and food create big risks? How did it know that aftermarket products for Cybertruck are in the works? I have to assume it’s the LLM that is “collaborating” to produce helpful answers.

Thoughts?


2 Likes

There is a Library LangChain that is doing some really interesting things with retrieval and embeddings.

1 Like

Thank you. I would not call myself an expert though. I hope anyone who reads my posts critically evaluates them as if I was just a person on the street.

I think what you have clarified very well it with your post. Context & information delivery is very effective and it clearly shows that GPT-3 & 4 is more than capable of absorbing the information into its knowledge

Unless I’m mistaken, CustomGPT isn’t creating an LLM, they are building on the models available through the OpenAI API.

But, you are really good at explaining some of the technical aspects. There’s lots of proof in your body of commentary.

I also assume this is the case for a couple of reasons.

  1. The processing time it takes before the “model” is ready to test is very brief.
  2. The training corpus is a simple PDF document format - nothing special; no JSONL, etc.

This is partly what makes it so compelling for solutions that are very focused and relatively simple to automate the content-building process. Mileage may vary, of course, but so far…so good.

1 Like

@bill.french and @RonaldGRuckus, thank you for the response. Here is my understanding of things:

  1. It is not possible to add knowledge to any of the models OpenAI provides through their API. Their API allows for fine-tuning certain models, but the fine-tuning only conditions the model for a certain pattern of output; it does not actually add new knowledge to the model.

  2. Moreover, I am not aware of any fine-tuning method that allows you to reliably add knowledge to any models.

  3. My guess is that CustomGPT is actually doing retrieval augmented generation where the IR step is performed using a vector database that stores embeddings of external data. This is the approach described by @wfhbrian above, and it is pretty much how every service like CustomGPT provides this sort of functionality.

1 Like

That seems logical to me.

Indeed, however, it seems that with the right mix of these principles, we can mimic behaviours that users perceive as doing exactly that. Is that a valid way to explain what I’m experiencing?

Yes, precisely. Using retrieval augmented generation, it is possible to give the illusion that the model has gained new knowledge.

@kintela You definitely need to use embeddings. The way it works that that the embedding will map your text data to a vector (a list of numbers). Then a new piece of data comes in, say a question, and it gets embedded too. Then this gets compared to all the information you have embedded, one at a time (just multiply the numbers together and sum … this is called a dot-product). Then you have your top vectors that match. Then you take this, and get the corresponding text behind these vectors. Now you have similar or related text.

For example, after the search for most correlated vectors, you have data that is related to the incoming question. You then feed this data back into GPT, in the prompt, and ask GPT to answer the question based on all this related data. Then it should spit out an answer based on your data, even though GPT was never trained on your data, it is smart enough to absorb your data in the prompt and draw from it.

For an example notebook, check out THIS!

The link will show you the general concepts. Once you figure that out, it’s then a decision on how you will construct your embedding database. For me, I can create an in-memory data structure for searching, and also have a database to retrieve the actual text. I can do this in a serverless enviromnent with around 400k different embeddings with 1 second of latency. But there are SaaS offerings for vector databases too, such as Pinecone. They aren’t cheap, but they can be faster than my approach and handle much more data (think billions of embeddings).

Your input will have to be text. So you need to extract the text from your PDF files, Doc and XLS files to get it to work. There are tools that you can use to do this, or you can just copy and paste.

Another “hyper parameter” is how big of chunks of data should you embed at once (sentence, paragraph, page, thought, etc). I don’t think there is a hard and fast rule, and it depends on which GPT you are using (they all have different window sizes), so that is something to consider and experiment with. But give the embedding a large enough chunk that is coherent so that when a series of these chunks are stuffed into the prompt, it doesn’t look like a non-coherent jumbling mess to the AI.

You can even “micro embed”, I do this in my name similarity engine I built for my business (embed first and last names of people separately, and compare similarity). It all depends on your use-case and what makes sense.

3 Likes

Holy F*ING ST! My head just exploded. I’m so dense. Is it possible to perform a vector search by simply multiplying the array items? You said this to me in another thread and I totally did not see the use of the dot-product. This is reliable?

2 Likes

@bill.french Haha Bill, yes, that’s all it is. If your embeddings are unit vectors, which they all are from OpenAI, you only need to use the dot-product, which, sadly, is just point wise multiply the array terms and sum the result. The largest number you can get is +1, and the smallest is -1. But due to the non-isotropic nature of OpenAI’s embeddings, you will not see anything much less than +0.7. But that is another story.

1 Like

Apologies to @kintela for hijacking the topic.

Okay - this is so much more simplified than I imagined. Given this …

"embedding": [
        -0.006929283495992422,
        -0.005336422007530928,
        ...
        -4.547132266452536e-05,
        -0.024047505110502243
      ],

I multiply …

embedding[0] * embedding[1] * embedding[2] * embedding[3] * embedding[4] *  ...

And that leaves me with a product that I can compare to other embeddings with the same treatment. The smaller the delta, the closer the similarity, right?

… and sum the result.

This part I’m not getting.

You would actually multiply the value of each vector in the embeddings that you are comparing, and then sum it all up

so it’d be

(first_embedding[0] * second_embedding[0]) + (first_embedding[1] * second_embedding[1]) + […]

1 Like

I see. So, in a sizeable list of comparisons, that’s a lot of crunching.

In your initial example, where you multiplied all the terms together, that is N-1 multiplications, and the actual dot-product between two vectors has N multiplications, so its essentially the same.

Dot-products are the backbone of AI, and it’s not because of embeddings, its because when you multiply two matrices together, you are taking various dot-products of all the rows and columns of the two matrices. And this matrix multiplication is the backbone of AI. The only thing following the matrix multiply is the non-linear activation function, and these are not as computationally expensive as the matrix multiplies (which are lots and lots of dot-products).

So if you like AI, then you like dot products! OK, maybe not, but that is basically all that is going on computationally behind the scenes.

Just for reference, and general appreciation of the dot-product, I will take the dot-product of the vector [1, 2, 3] with the vector [4, 5, 6]. The answer is 1*4 + 2*5 + 3*6 = 32. See, was that so hard?

1 Like

Okay, you guys have really opened my eyes. Ergo?

let qAsked;
let qAskedE;

// define the dot product
let dot = (a, b) => a.map((x, i) => a[i] * b[i]).reduce((m, n) => m + n);

// get the trained embedding
let qTrained  = "Does CyberLandr have its own battery?";
let qTrainedE = JSON.parse(getEmbedding_(qTrained)).data[0].embedding;

// ask the question a similar way
qAsked    = "Is there a battery in CyberLandr?";
qAskedE   = JSON.parse(getEmbedding_(qAsked)).data[0].embedding;
Logger.log(dot(qTrainedE, qAskedE)); // 0.9667737555722923

// ask the question a less similar way
qAsked    = "Does CyberLandr get its power from Cybertruck?";
qAskedE   = JSON.parse(getEmbedding_(qAsked)).data[0].embedding;
Logger.log(dot(qTrainedE, qAskedE)); // 0.9177178345509003

I think I’m about to start a love affair with dot products. Thanks to both of you!

So, @kintela, to rejoin the point of your question, we circle back to @wfhbrian’s original response…

I was hip to embeddings before I understood the mechanics of embeddings, and I knew in your case, the most practical and cost-effective way to build your solution was probably with embeddings.

2 Likes

@cliff.rosen Thanks for the clarification questions and thank you @bill.french for the detailed response.

Just to clarify: CustomGPT does not build new LLMs. We take your training data and use embeddings to pass context to the ChatGPT API. This is very similar to how the new Plugins functionality works. If you read the Plugins reference code that OpenAI put up, we do all that (and some more). So its basically available for customers in a no-code platform. Our customers come and upload their data and/or websites and build bots from them (behind the scene, we use embeddings and completions)

1 Like

It sure is. Works like a charm. (And very nicely explained by @curt.kennedy )

Though as you start dealing with large datasets, there are other issues that crop up when calculating the context (like duplicates). And then once you’ve eliminated the duplicates, you will look at the resulting context and say “Huh? Can I make this context better for my use case?” – so that kicks off a “post vector search” optimization initiative.

2 Likes