Teaching OpenAI API with new data

I am using OpenAI API for to help me search through a large amount of data (text). Based ONLY off of the data that I am passing into the system parameter, I want a response generated. Does anyone have suggestions on how to do this for optimization? Do I have to pass in the data each time? By the way, I have the data in a MySQL DB. I am concatenating the data in the columns into one LARGE string and passing it in with additional instructions saying “Only answer based off this information…”. Anyone have any solutions to this?

Usually you correlate the input to data in your system. Then only feed this correlated data to the System/Input for the LLM to answer from.

This is call RAG or Retrieval Augmented Generation.

Whenever you are using a DB, I am thinking you are a serious user. Please consider function calling which I used for my application.

@curt.kennedy Can you please elaborate? I’m just a computer science student.

I’ve seen others use LangChain. Would you say this source is reliable?

Sure, I would start by understanding this classic Cookbook from OpenAI.

But in general, you use embeddings (from AI models) or keywords, or both combined, to “retrieve” text chunks that are relevant, and present them for the LLM to use as information to answer the input query, along with previous back-and-forth history with the user.

Embeddings are vectors from certain AI models. Each chunk of text gets a vector assigned. This vector is a set of numbers that give the chunk of text a numerical meaning.

Keywords are generated based on some information theoretic rarity index. (not necessarily AI based). So rare words have more information than common words. And this search is optimized by intersecting the most rare words with the user query that also matches the rare words in your text chunks.

Finding close vectors (close semantic meaning) is achieved with dot products (cosine similarity) since the vectors are usually unit vectors. If not, you would do the normal euclidean distance. And in a pinch, with billions or a ridiculous amount of records, you use a multiply-less Manhattan metric. Beyond this, there are AI models that also search, like FAISS.

These vector correlations are done in memory, for speed, and the actual text chunks are looked up in a database. Scanning a database is expensive. Vector correlations are just multiplies and adds on the vectors, nothing crazy. So the straightforward argmax vector search can be done in memory with multiplies and adds in a single for-loop. So once you get your winning vector and associated hash, you can also get the text chunk in your database.

But if you have enough memory, you can put all the vectors and the text into memory and skip the database. This is probably the way to go if just starting out. Why not do this out the gate? Well, it’s frowned upon because you want to preserve all your high speed memory for search (vector correlations), and let the database do the “dumb lookup” duty of getting the actual text chunk. So best practices, basically, says separate the two.

Once you get embeddings implemented, and decide to use keywords, you can fuse both results together using Reciprocal Rank Fusion or the newer density preserving Relative Score Fusion (ref)

Beyond this, you can get LLM spawn new queries from the initial query, and you take all these multiple queries (human and synthetic), run them all through the RAG, and rank everything, and fuse them together into one optimal stream to retrieve the most relevant and insightful information that the LLM has discovered, beyond the user’s query, and including the users query. You can weight each view/perspective separately in the fusion to influence what is more important. Is it the User, or some other set of things the LLM has discovered with its synthetic queries? It’s up to you.

You won’t learn much by using Langchain, but you can use it and see what it does under the hood since it is open source. Maybe get some ideas that you are missing, and develop from there.

IMO, Langchain is overkill, and not needed, and creates a veil of mystery, if that’s all you know and rely on. So I would recommend Langchain for those that just want to get something going, without learning the details. But also good for those who want to learn the details, and want to learn what RAG systems can do, after you stare at the code. It’s a double edged sword.