Would using embeddings like this work in theory?

I know it’s not possible to expand the API’s knowledge base in general, but I’m trying to figure out a way to use embeddings to situationally retrieve additional information during chat completions.

Let’s say I’m making a Python helper, and I want to use an embedding to get info on a package it may not be familiar with if the user query is about that package. Would something like the following code work?

messages = []

def get_vector_match():
    # Retrieve vector db
    # Cosine similiarity vector matching
    vector_text = # Text chunk of closest matching vector
    return vector_text

def get_prompt(input):
    context = []
    messages.append(input)
    for index, message in enumerate(messages):
        if index % 2 == 0:
            context.append({"role": "user", "content": message})
        else:
            context.append({"role": "assistant", "content": message})
    return context

def create_chat_completion(input):
    top_match = get_vector_match()
    completion = openai.ChatCompletion.create(
        model="gpt-4",
        messages=[{"role": "system", "content": f"Use this information in your response if it's relevant to the query: {top_match}"},
                  {get_prompt(input)}],
    )
    messages.append(completion.choices[0].message.content)
    return completion.choices[0].message.content

while True:
    user_input = input("You: ")
    completion = create_chat_completion(user_input)
    print("Bot: ", completion)

There’s probably a better way to use the system prompt, but hopefully that gets the point across.

OpenAI actually has an example of code-search using embeddings in Python. You could mimic this to pull in all your code and embed it.

Then you would feed the relevant content to the LLM, using a standard RAG approach, like this other cookbook:

Note that you need to give the LLM an out, and allow it to respond appropriately if the retrieved information is not relevant to the question. So allow it to say “I could not find an answer.” See cookbook for an example of how to do this.

The other feedback would be your mod 2 implementation of assistant/user pairs. Sometimes the user can send two or more questions before the next assistant, so you may want to be more explicit, and pull the assistant/user streams sorted by timestamp, and not use interleaving assumptions … just to be safe.

The vectors and text you embed would be in a database, and you would store the hash of the text as the key into the database, and search the vectors as a linear binary search using python and numpy for the best performance. Then get the hashes from the top vector matches (positionally) and index into the DB to get the text.

However, you can even ditch the DB and have all text and vectors in separate arrays if you have enough memory (works for most small RAGs and is the fastest option most likely).

I have found vectorized code in numpy isn’t needed for search, just simple for-loop linear is fastest, but feel free to benchmark different implementations to see what is fastest in your environment.

4 Likes

Thank you for the detailed response, these cookbooks are exactly what I needed. A couple questions if you don’t mind:

Note that you need to give the LLM an out, and allow it to respond appropriately if the retrieved information is not relevant to the question. So allow it to say “I could not find an answer.” See cookbook for an example of how to do this.

I scanned through the links but didn’t see this covered specifically. Will it default to a standard response if the information retrieved is not relevant or do I have to explicitly tell it to say something like “I could not find the answer”?

However, you can even ditch the DB and have all text and vectors in separate arrays if you have enough memory (works for most small RAGs and is the fastest option most likely).

Can you expand on this a little? I’m new to working with vectors but I thought they were pretty complicated data structures than needed a dedicated DB like PineCone. How large/how many vectors can you store in arrays?

It’s in the notebook, when you run it. But it’s important. Here is the line in the notebook:

Use the below articles on the 2022 Winter Olympics to answer the subsequent question. If the answer cannot be found in the articles, write "I could not find an answer."

So just prompt it. You would tailor this to your application, but something general like this is required in the prompt.

Sure I can expand. And it’s a myth you need Pinecone or some complicated database to run vector queries. But it takes a bit of understanding, and a few more lines of code to get it running yourself. Let me explain …

A vector is just a list of numbers. It has deeper meaning than this, but you don’t need to understand the details to use these vectors. Just think of them, at a high level, as a list of numbers that contains the fingerprint of whatever text they were embedding. So …

Text \rightarrow Vector (fingerprint of Text)

You see if fingerprints are the same as others by doing some very simple math on them. You multiply the two lists together to get a new list, then sum up all the numbers in this new list to see how the two fingerprints correlated.

So

List A:

(x_1,x_2,x_3)

List B:

(y_1,y_2,y_3)

The new list is multiply each element point wise:

(x_1*y_1,x_2*y_2,x_3*y_3)

And then sum to form the correlation of the two fingerprints (a number between -1 and +1):

C = x_1*y_1+x_2*y_2+x_3*y_3

This value C is the correlation of the two fingerprints, or how related the texts are.

If you are using Ada-002 embeddings, each list will have 1536 numbers, so:

C = x_1*y_1+x_2*y_2+x_3*y_3 + ... +x_{1535}*y_{1535} + x_{1536}*y_{1536}

So it’s simple multiplies and adds. You don’t even have to do any additional math with Ada-002 because they are length one, unit vectors, so just simple multiply and add, and no additional division to normalize out the length of the vectors.

The vectors are created in such a way that the max this correlation C can get is +1, and the most negative it get’s is -1. But in reality, if using Ada-002, your data is correlated if C > 0.9, and not correlated if C < 0.8. The area between 0.8 and 0.9 is grey and unknown. But this is only specific to the model Ada-002. In general C = -1 is the total uncorrelated case. So this is model specific.

So with this working explanation, you can now correlate text by just multiplying and adding the corresponding vector coordinates!

You can do this multiply and add in any programming language. The data structure can be hard, or simple, depending on what you are comfortable with.

The “hard” one I use is one that looks like this in Python.

{“Hash Of Text 1”: “Embedding Vector of Text 1 as numpy array”,
“Hash Of Text 2”: “Embedding Vector of Text 2 as numpy array”,

“Hash Of Text N”: “Embedding Vector of Text N as numpy array”}

This is a dictionary. The hash is formed by taking the SHA-256 hash of the text (or whatever your favorite hash is), and the embedding vector is a numpy array. Hashing is not required (see below) but it is an easy way to sync database and text together, or sync database and vectors together, because the sync occurs with the hash.

At runtime, this is stored in a python pickle, which is a binary file containing the hash/vector data.

So you load in the pickle, and then you have the data structure listed above.

You then form two arrays. One from the hashes (a column now), one with the vectors (another column, or array), or just save these as separate arrays, maybe two pickles, however you want to organize this.

Then you embed the incoming text, correlate it with all the vectors (point wise multiply, then sum, for the new vector and each vector in the list, one for-loop). You pick the hashes (or indices) of the top K correlations, and then retrieve your corresponding text. So I do this by using the hashes, and look up the text in a database.

But you could avoid the database, and have all the text in a separate array, and index by position, the top indices matching the top vector correlation positions. So this takes more memory, but may be easier and faster. But the database is more hassle to set up, causes some additional latency in the query, but you use less memory.

You could process many many embeddings (thousands, hundreds of thousands) but you eat into your precious memory. Why do you care? Well, it’s best to use all your memory for the embedding correlation, and leave the lookup to the database after the correlation.

If you have memory to spare, then yeah, put it all in memory, both text and vectors, and not use the DB is probably optimal since you are not doing external lookup calls to a DB.

So most people start out in this situation, and probably need no DB at all, and linear search, in python, especially using numpy “np.dot(x,y)” is already really fast.

Hope this helps!

@lostinsauce

As this topic has a selected solution, can it be closed?

@EricGT Yes it can be closed.

@curt.kennedy Thank you very much for the info, extremely helpful!

1 Like