It’s in the notebook, when you run it. But it’s important. Here is the line in the notebook:
Use the below articles on the 2022 Winter Olympics to answer the subsequent question. If the answer cannot be found in the articles, write "I could not find an answer."
So just prompt it. You would tailor this to your application, but something general like this is required in the prompt.
Sure I can expand. And it’s a myth you need Pinecone or some complicated database to run vector queries. But it takes a bit of understanding, and a few more lines of code to get it running yourself. Let me explain …
A vector is just a list of numbers. It has deeper meaning than this, but you don’t need to understand the details to use these vectors. Just think of them, at a high level, as a list of numbers that contains the fingerprint of whatever text they were embedding. So …
Text \rightarrow Vector (fingerprint of Text)
You see if fingerprints are the same as others by doing some very simple math on them. You multiply the two lists together to get a new list, then sum up all the numbers in this new list to see how the two fingerprints correlated.
So
List A:
(x_1,x_2,x_3)
List B:
(y_1,y_2,y_3)
The new list is multiply each element point wise:
(x_1*y_1,x_2*y_2,x_3*y_3)
And then sum to form the correlation of the two fingerprints (a number between -1 and +1):
C = x_1*y_1+x_2*y_2+x_3*y_3
This value C is the correlation of the two fingerprints, or how related the texts are.
If you are using Ada-002 embeddings, each list will have 1536 numbers, so:
C = x_1*y_1+x_2*y_2+x_3*y_3 + ... +x_{1535}*y_{1535} + x_{1536}*y_{1536}
So it’s simple multiplies and adds. You don’t even have to do any additional math with Ada-002 because they are length one, unit vectors, so just simple multiply and add, and no additional division to normalize out the length of the vectors.
The vectors are created in such a way that the max this correlation C can get is +1, and the most negative it get’s is -1. But in reality, if using Ada-002, your data is correlated if C > 0.9, and not correlated if C < 0.8. The area between 0.8 and 0.9 is grey and unknown. But this is only specific to the model Ada-002. In general C = -1 is the total uncorrelated case. So this is model specific.
So with this working explanation, you can now correlate text by just multiplying and adding the corresponding vector coordinates!
You can do this multiply and add in any programming language. The data structure can be hard, or simple, depending on what you are comfortable with.
The “hard” one I use is one that looks like this in Python.
{“Hash Of Text 1”: “Embedding Vector of Text 1 as numpy array”,
“Hash Of Text 2”: “Embedding Vector of Text 2 as numpy array”,
…
“Hash Of Text N”: “Embedding Vector of Text N as numpy array”}
This is a dictionary. The hash is formed by taking the SHA-256 hash of the text (or whatever your favorite hash is), and the embedding vector is a numpy array. Hashing is not required (see below) but it is an easy way to sync database and text together, or sync database and vectors together, because the sync occurs with the hash.
At runtime, this is stored in a python pickle, which is a binary file containing the hash/vector data.
So you load in the pickle, and then you have the data structure listed above.
You then form two arrays. One from the hashes (a column now), one with the vectors (another column, or array), or just save these as separate arrays, maybe two pickles, however you want to organize this.
Then you embed the incoming text, correlate it with all the vectors (point wise multiply, then sum, for the new vector and each vector in the list, one for-loop). You pick the hashes (or indices) of the top K correlations, and then retrieve your corresponding text. So I do this by using the hashes, and look up the text in a database.
But you could avoid the database, and have all the text in a separate array, and index by position, the top indices matching the top vector correlation positions. So this takes more memory, but may be easier and faster. But the database is more hassle to set up, causes some additional latency in the query, but you use less memory.
You could process many many embeddings (thousands, hundreds of thousands) but you eat into your precious memory. Why do you care? Well, it’s best to use all your memory for the embedding correlation, and leave the lookup to the database after the correlation.
If you have memory to spare, then yeah, put it all in memory, both text and vectors, and not use the DB is probably optimal since you are not doing external lookup calls to a DB.
So most people start out in this situation, and probably need no DB at all, and linear search, in python, especially using numpy “np.dot(x,y)” is already really fast.
Hope this helps!