Embedding is for semantic search, classifying and clustering

You can feed the result from semantic search back into a completion with a question

You can build a robust classifier (You can do this to some degrees with the completion endpoint too)

You can cluster text into groups to find hidden similarities. You can feed two, three (or more) back into a completion and ask how they are common to find out what you may not see as a human.

Excellent! I know a doctor who just started a project that requires some light meta-analysis of radiology curricula in the context of family practice residency programs. I’m thinking this might help!

Hi, if you wouldn’t mind clarifying a bit: Do I use text-similarity to perform semantic search on a document?

I don’t have the research papers yet, so to practice, I want to summarize Lewis Carroll’s corpus and be able to query it about Wonderland. If my understanding is correct, a query like “Who is at the Mad Hatter’s tea party?” would return the summarized passage describing the tea party?

I found this:

… but they appear to use a pre-formatted dataset, so I don’t know if I need to do any preprocessing of the text to make it match that format

Ah! I need to read more closely. This was staring me in the face:

It does appear to require a .csv format, though:

import pandas as pd
import numpy as np
datafile_path = "./data/fine_food_reviews_with_embeddings_1k.csv" # <- This right here
df = pd.read_csv(datafile_path)
df["ada_search"] = df.ada_search.apply(eval).apply(np.array)

Would this work?:

import pandas as pd
data = pd.read_csv("input.txt", delimiter="\t")
data.to_csv("output.csv", index=False)

Yes, it doesnt have to be a csv file. It could be a tab delimited file, or a database, or any other data source

The examples use a CSV file for simplicity. You can also view the vectors in the file once you have done the embedding. The food review database is opensource and, from memory, the url has a link to a copy you can download

import openai
import numpy as np
import math
import pandas as pd
with open("AAiW/alice_in_wonderland.txt", "r") as f:
data = f.read()
# Calculate the number of tokens in each chunk
chunk_size = math.ceil(len(data.split()) / 5)
# Split the text into chunks
chunks = [data[i:i+chunk_size] for i in range(0, len(data), chunk_size)]
# Initialize an empty list to store the embeddings
embeddings = []
# Call the Embedding.create function for each chunk
openai.api_key = "OH-NOYOUDONT"
for chunk in chunks:
response = openai.Embedding.create(input=chunk, model="text-embedding-ada-002")
embeddings.append(response['data'][0]['embedding'])
# Concatenate the embeddings to get the final embedding for the entire text
final_embedding = np.concatenate(embeddings)
print(len(final_embedding))
def get_embedding(text, model="text-embedding-ada-002"):
text = text.replace("\n", " ")
return openai.Embedding.create(input = [text], model=model)['data'][0]['embedding']

Making progress! No idea why they had to split up the documentation like they did, but it doesn’t seem to be in order

Yes, look at how the model (text-embedding-ada-002) perceives these comparative tests using the OpenAI API embeddings and a hand-coded cosine similarity function (in Ruby):

So, at least according to the model above, out of the five words / phrases above, dogs are most similar to pets, followed by the next similar to dog food and then cats, and so forth, with rabies being the least similar.

Running these tests (using the OpenAI API embedded method) is currently free, very cheap, so I have found one of the best ways to get a grasp on these vectors is to run a bunch of comparison tests and look at the results.

Thanks for reviving this thread! The person I was helping with this has been incommunicado, but it’s great to have this info for myself and other users. Does ruby-openai not include embeddings_utils? Bc the Python version has cosine_similarity

Fair, I just wanted to check, since a lot of folks coming in aren’t necessarily familiar with linear algebra, so using an included function could prevent some headaches!

It takes less time to ask ChatGPT to write this method than to discuss it So, like I mentioned, it seems “not important” (to me) if the ruby-openai gem includes this method or not. Sorry, not trying to be difficult, just being factual. It’s just a few lines and it’s an easy task for ChatGPT, TBH:

def cosine_similarity(a, b)
dot_product = a.zip(b).map { |x, y| x * y }.reduce(:+)
length_a = Math.sqrt(a.map { |x| x ** 2 }.reduce(:+))
length_b = Math.sqrt(b.map { |x| x ** 2 }.reduce(:+))
dot_product / (length_a * length_b)
end
a = [1, 2, 3]
b = [4, 5, 6]
cosine_similarity(a, b) # Output: 0.9746318461970762

Since the vectors coming out of ada-002 are all unit length, you can just take the dot product (without the square roots and normalization).

In numpy, this is just

c = np.dot(q,v)

Or by hand by computing the sum of the coordinate-wise products of the two vectors (but numpy likely has a faster optimization than a hand coded version)

This will be close to +1 for correlated embeddings.

To get an angle in degrees, just take the inverse cosine of the dot product (only is correct for this unit vector case)

def safe_acos(d):
if d > 1.0:
d = 1.0
if d < -1.0:
d = -1.0
return (180/math.pi)*math.acos(d)

To optimize, just look at the dot product, not the angle to avoid unnecessary computation using acos

Excellent point about using the dot_product and not the cosine similarity function @curt.kennedy. Thanks.

Actually, I originally wrote my test model using only the dot product, but I read somewhere that cosine similarity was the preferred method, so I went that way yesterday.

However, after testing this AM, I changed my testing method back to the dot product. (… actually, I included both methods and made the dot product the default method).

Yes, but if the magnitude of the unit vector is normalized to 1, then is the dot product more efficient?

ChatGPT

If the magnitude of the unit vector is normalized to 1, then the dot product is equivalent to cosine similarity, as both methods will be measuring the cosine of the angle between the vectors. In this case, the dot product is just as efficient as cosine similarity in terms of computational cost, as both methods involve taking the dot product of the vectors.

When comparing text vectors, it is common to represent the text as a high-dimensional sparse vector, where each dimension corresponds to a word in a vocabulary. The sparse nature of this representation makes the dot product less efficient than cosine similarity, because the dot product requires iterating over all the non-zero elements of the vectors, while cosine similarity only requires the dot product of the vectors.

If the vectors are not sparse, then the dot product and the cosine similarity will be equivalent in terms of computational cost, and using dot product is more memory efficient as it doesn’t need to store the vector’s norm.

In summary, if the vectors are normalized and not sparse, the dot product and cosine similarity are computationally equivalent. However, if the vectors are sparse and represent text data, cosine similarity is generally considered to be more efficient because it only requires the dot product of the vectors and not iterating over all the non-zero elements of the vectors.

Technically you are still doing cosine similarity with the dot product in the case of unit vectors. If you look at the formula, the cosine of the angle between vectors is the dot product divided by the product of the lengths of each vector. So in this case the length of each vector is 1, so your denominator is 1. So all you have in the numerator is the dot product. So the cosine of the angle is the dot product!

@raymonddavey is exactly right. It doesn’t have to be a csv file, but not that it cannot be an xlsx file because the Excel cells can’t hold the vector because the vector is too big. CSV files work really well, though.