How to Add Knowledge Base in API

Hello guys, how if I wanted to add a knowledge base in the GPT API, how would I go about doing that? I looked online and people were saying fine-tuning models is not meant to add knowledge.

People are mentioning RAG but I’m not sure what that entails besides being the acronym for retrieval augmented generation, and I’m wondering if there’s a built-in way of doing it in the GPT API.

Surely adding a custom knowledge base in the API should not be a unique ask but I haven’t seen any good answers online.

I also checked the API documentation and I saw something called the assistant API but this is for creating fine tuned models, right?

Thank you for your help


Have you searched either OpenAI Cookbook or Langchain?

This was just the first results of each there may be more and some may be better suited to your specific need.

1 Like

I believe what you’re looking for is embeddings that you then use with RAG. You can use embeddings to create vectors of your knowledge base, then use cosine similarity to find if the prompt vector is near to any knowledge base vectors. If it finds any matches, it’s supposed to use that information instead of the base model’s information.

This isn’t really expanding the knowledge base, instead it’s searching your knowledge base for a relevant answer each time.

This tutorial gives a pretty good overview of embeddings. I’m still new to embeddings and RAG myself so hopefully someone can add to/verify this information.

Thanks for the response, I’m wondering if there is a built-in way to do this inside the GPT API itself.

Looking at this article, under the GPT cookbook link you sent - assistants_api_overview_python(i cant include links), it seems like assistant API can use files? Is that right? Would that be the builtin feature I am looking for? Or is there something I am missing

Do you know if there is a way to do this without embedding? Like having the API access a file in your directory to use as a knowledge base? I am trying to streamline it as much as possible

Unfortunately there isn’t a way to do that yet that I know of. The Assistants API can use external documents, however, from OpenAI’s documentation:

Retrieval augments the Assistant with knowledge from outside its model, such as proprietary product information or documents provided by your users. Once a file is uploaded and passed to the Assistant, OpenAI will automatically chunk your documents, index and store the embeddings, and implement vector search to retrieve relevant content to answer user queries.

So the Assistant API is embedding also.

I am also confused about this question. we need the gpt turly learned the new acknowalges but let it retrieval.

The GPT’s are more of a quick RAG for ChatGPT users.

The Assistants are more of a quick RAG for API users.

But RAG, in general, is a general concept and can be used with any retrieval strategy (dense/sparse \approx embeddings/keywords) and any LLM.

So what skill level are you shooting for, and how much sweat equity are we willing to put in here? This will guide where to go next.


GPT4 Tutorial: How to chat with multiple pdf files - The Chat Completion Process (R.A.G. / Embeddings)

There are a few ways you can approach this. I will say it depends on the length of the knowledge base.

If it is comfortably below that of the context window, then you might want to use either custom instructions or regular prompting.

If it’s longer than that you might want to use the Assistant feature and upload your knowledge base as a file instead.

Direct Prompting

Here’s how you can potentially add it in as a prompt message before your question:

I'll be providing you base knowledge for this task, please use this to complete the task after.
... your data goes here

Custom Instructions in ChatGPT

If you do see yourself using this knowledge base frequently, add it to the custom instructions.

Longer & Customizable Custom Instructions on Wielded

If it doesn’t fit into that space, you might want to check out alternative ChatGPT clients that has switchable and bigger custom instruction space such as Wielded.

Here is an interesting example where I stuffed my entire data model and backend architecture and code convention into a persona (which are custom instructions) and getting it to write backend code for me.

1 Like

TLDR: There is no easy way other than using Assistants or GPTs. These are the closest to a no-code or little code solution you’ll find without paying a third party a bunch of money.

I would say I’m a complete beginner right now - I haven’t even started writing the code yet. Im in the phase of determining feasibility. Honestly, I have no issue putting in a lot of sweat equity and learning how to implement the RAG myself.

That being said, I’m guessing then I should be using the assistant API for my purposes?

Is there any different in using this versus coding out a RAG approach? Any difference in quality/accuracy?

Yes I would start with the Assistant API since you are willing to implement RAG yourself eventually.

The main differences in Assistant and your own RAG would be control and cost.

If you implement a RAG yourself, use your own local or cloud resources for computing the correlation, and pulling in the text from a database, you can do this very cheaply, and control how many tokens (history/context) is sent per API call.

Whereas current Assistant only has truncation as a strategy, currently little control, and your costs could get very high because of the large buffer size available, the conversation just adds and adds tokens, eventually becoming expensive unless you do something (start a new thread?). But I expect this to improve over time, so don’t think of this as a long term deterrent, but it will be felt short term until this is parametrized via the API.

A larger deterrent from using the Assistant API is the limitations on how much content you can store, search, and retrieve. So if you start heading into large amounts of content, you would want to move to your own RAG, since data and compute is readily available and cheap.

But watch out for expensive vector database vendors, they can add up and are not necessary if you are looking at less than, say, 1 million vectors or so, on a sporadic basis.

In this case, better to code this by hand using basic linear search using python and numpy. And you can always split the vectors into chunks and run them in parallel to hit your latency requirements.

Currently can do 400,000 vectors / per second / per worker in AWS Lambda. And this can scale to more workers just by splitting the binary file containing the embedding vectors.