Replicate langchain custom data query with openai python API?

A youtube video from TechLead called " Using ChatGPT with YOUR OWN Data. This is magical" (I cannot embed links) outlines some steps to using langchain to embed custom data to embed in a query. He calls VectorStoreIndexCreator() on his text data, then can combine the return from that with an llm of his choosing and a query. My question is how do you do this without langchain, but instead just using the openai python API?

I had been assuming that VectorStoreIndexCreator() creates embeddings, but I have no idea where one would combine a set of embeddings with an LLM query.

1 Like

You basically store your embedding vectors, along with the text embedded, in your own database.

Then when something new comes in that you want to find a good match to in your database, you take the dot product of the the vectors (so dot(NewVector,{CollectionInDatabaseVectors}).

This correlation is done in memory, not a scan over the database for speed reasons.

Then your retrieve the topK hashes, and then use those hashes to index the text back in your database.

Then with this retrieved text, you form your prompt programmatically, and send it to the API for the LLM to respond.

That’s it!

If your data is small (~thousands of embeddings), you can do the whole thing in memory.

If your data is medium (~millions of embeddings), you search in memory, and refer to the data in a database.

If your data is large (~billions of embeddings), you might want to look into AI based search algorithms, or use services like Weaviate or Pinecone.

If you still don’t want to do that, it’s fine, you just need to shard your vectors up, and have a way to search all the shards in parallel, which is easily doable with cloud services, or doable on a beefy machine you can access.

2 Likes

Thank you. Can you help me with the intuition and nomenclature behind this?

Say I have a bunch of mildly structured data describing companies. Something like { company, date, speaker, text }. I vectorize all those objects (or just the text?) and store them along side my raw string.

I then want to know something like “what is microsoft doing to expand its ai market share?” I vectorize that question, and query out the closest matches to that vector. I use some number of those to then include the corresponding text in my LLM completion?

Does that query look like “{ company1, date1, speaker1, text1 }, { company2, date2, speaker2, text2 }, { company3, date3, speaker3, text3 }. what is microsoft doing to expand its ai market share?”

This implies the essence of the creating the indexes is to take a large set of data from the client side, and then whittle it down on the client side to a manageable amount to include in a query?

There are many options to relate the question to your data. The main intuition is you want to maximize the semantic meaning of the question to the semantic meaning sitting in your database.

So a popular, and straightforward, thing to try first is embed the question, and correlate this vector with the vectors in your database. You feed this to the LLM as the “Context” and ask the LLM to answer the question only using this “Context” or else respond “I don’t know”.

But sometimes the question doesn’t line up with your data, it could be for various reasons.

There are a few more tricks to get it to line up the two.

One is to use HyDE, which basically asks the LLM to answer the question, so the answer is just made up, and you don’t want to send that back to the user. But what you do, is take this hypothetical answer, and embed this, and correlate it with your vectors. This in theory should transform the question into something more closely matched in your database, but only for “open domain” sets of knowledge the LLM was likely trained on, so not super secret crazy stuff the LLM has never seen.

Another approach that I like is using keywords and embeddings. This is more involved since you need to created your own “rarity index” of the words in your data set. You then split the incoming text into its component words, with their frequency, and overlap this with your data, and matching words, and their frequencies. This is similar to the BM25 algorithm, but more scalable IMO.

Anyway, after you get the keyword stream ranking, you also get the embedding stream ranking, and you fuse these two rankings into a single ranking using harmonic sums. This is called Reciprocal Rank Fusion. Now you pull the text chunks matching the highest fused ranking between keywords and embeddings.

All this can be done locally or in the cloud without fancy services, but you have to be an algorithm person if you want to do it yourself.

But going back to your original question. Think about the incoming queries, and what transformations (if any) are needed to match your data the best. Do you need HyDE? Do you need to maybe translate your data to be more similar to the expected questions? Do you need to add a keyword functionality as well?

Then you correlate (in memory for speed) and get your topK embedding matches (and also topK keyword matches if you need that) and then you feed this into the next stage, which is a prompt for the AI to carefully answer the question from, or refuse to answer if it thinks the question is out of scope.

2 Likes