Holy F*ING ST! My head just exploded. I’m so dense. Is it possible to perform a vector search by simply multiplying the array items? You said this to me in another thread and I totally did not see the use of the dot-product. This is reliable?

2 Likes

@bill.french Haha Bill, yes, that’s all it is. If your embeddings are unit vectors, which they all are from OpenAI, you only need to use the dot-product, which, sadly, is just point wise multiply the array terms and sum the result. The largest number you can get is +1, and the smallest is -1. But due to the non-isotropic nature of OpenAI’s embeddings, you will not see anything much less than +0.7. But that is another story.

1 Like

Apologies to @kintela for hijacking the topic.

Okay - this is so much more simplified than I imagined. Given this …

"embedding": [
        -0.006929283495992422,
        -0.005336422007530928,
        ...
        -4.547132266452536e-05,
        -0.024047505110502243
      ],

I multiply …

embedding[0] * embedding[1] * embedding[2] * embedding[3] * embedding[4] *  ...

And that leaves me with a product that I can compare to other embeddings with the same treatment. The smaller the delta, the closer the similarity, right?

… and sum the result.

This part I’m not getting.

You would actually multiply the value of each vector in the embeddings that you are comparing, and then sum it all up

so it’d be

(first_embedding[0] * second_embedding[0]) + (first_embedding[1] * second_embedding[1]) + […]

1 Like

I see. So, in a sizeable list of comparisons, that’s a lot of crunching.

In your initial example, where you multiplied all the terms together, that is N-1 multiplications, and the actual dot-product between two vectors has N multiplications, so its essentially the same.

Dot-products are the backbone of AI, and it’s not because of embeddings, its because when you multiply two matrices together, you are taking various dot-products of all the rows and columns of the two matrices. And this matrix multiplication is the backbone of AI. The only thing following the matrix multiply is the non-linear activation function, and these are not as computationally expensive as the matrix multiplies (which are lots and lots of dot-products).

So if you like AI, then you like dot products! OK, maybe not, but that is basically all that is going on computationally behind the scenes.

Just for reference, and general appreciation of the dot-product, I will take the dot-product of the vector [1, 2, 3] with the vector [4, 5, 6]. The answer is 1*4 + 2*5 + 3*6 = 32. See, was that so hard?

1 Like

Okay, you guys have really opened my eyes. Ergo?

let qAsked;
let qAskedE;

// define the dot product
let dot = (a, b) => a.map((x, i) => a[i] * b[i]).reduce((m, n) => m + n);

// get the trained embedding
let qTrained  = "Does CyberLandr have its own battery?";
let qTrainedE = JSON.parse(getEmbedding_(qTrained)).data[0].embedding;

// ask the question a similar way
qAsked    = "Is there a battery in CyberLandr?";
qAskedE   = JSON.parse(getEmbedding_(qAsked)).data[0].embedding;
Logger.log(dot(qTrainedE, qAskedE)); // 0.9667737555722923

// ask the question a less similar way
qAsked    = "Does CyberLandr get its power from Cybertruck?";
qAskedE   = JSON.parse(getEmbedding_(qAsked)).data[0].embedding;
Logger.log(dot(qTrainedE, qAskedE)); // 0.9177178345509003

I think I’m about to start a love affair with dot products. Thanks to both of you!

So, @kintela, to rejoin the point of your question, we circle back to @wfhbrian’s original response…

I was hip to embeddings before I understood the mechanics of embeddings, and I knew in your case, the most practical and cost-effective way to build your solution was probably with embeddings.

2 Likes

@cliff.rosen Thanks for the clarification questions and thank you @bill.french for the detailed response.

Just to clarify: CustomGPT does not build new LLMs. We take your training data and use embeddings to pass context to the ChatGPT API. This is very similar to how the new Plugins functionality works. If you read the Plugins reference code that OpenAI put up, we do all that (and some more). So its basically available for customers in a no-code platform. Our customers come and upload their data and/or websites and build bots from them (behind the scene, we use embeddings and completions)

1 Like

It sure is. Works like a charm. (And very nicely explained by @curt.kennedy )

Though as you start dealing with large datasets, there are other issues that crop up when calculating the context (like duplicates). And then once you’ve eliminated the duplicates, you will look at the resulting context and say “Huh? Can I make this context better for my use case?” – so that kicks off a “post vector search” optimization initiative.

2 Likes

Ergo, this is where mere mortals realize - “I should have used CustomGPT so my company could be using this three weeks ago.”

hey guys following up on this discussion

I’m actually maintainer of an open source api that solves this problem of connecting your data to ChatGPT/GPT-4 or building plugins. GitHub - different-ai/embedbase: A dead-simple API to build LLM-powered apps

Regarding duplicates,

embedbase never compute embedding twice through some tricks, saving everyone a lot of money

you can also do semantic search across multiple datasets

feel free to give a try by running it yourself or trying the free hosted version :slight_smile:

3 Likes

To avoid duplicates, I keep track of what I have embedded previously with the HASH of the previous embedding text, and use this as an index into my embedding database.

For example, “mouse” has a Sha3-256 hash of

import hashlib

X = "mouse"
H = hashlib.sha3_256(X.encode())
print(f"HASH: {H.hexdigest()}")
# HASH: 6ca66ca0a1713279fbdc1d0f5568d033d3593004251c3f564186a4e4e284cdeb

Then whenever I embed anything else, I compute the hash, and see if it is in my embedding database, if it is, then I don’t have to embed it, I just pull the previous embedding vector. You won’t have duplicates if you do this!

Note that GPT is case sensitive, so “Mouse” is different than “mouse”, and luckily this results in a separate hash too:

import hashlib

X = "Mouse"
H = hashlib.sha3_256(X.encode())
print(f"HASH: {H.hexdigest()}")
# HASH: 4c2e2fe9ae1d56701bea18593b67dc59d862106f959a132c640352780b5d0339

You can go with lower-case hashes too, but realize GPT “sees” that “Mouse” is different than “mouse”.

Note: Sha3-256 is probably overkill, but that’s what I use these days.

Oh, and to be clear, this is only on the database/lookup side. In my case, for search, I scan the database to create an in-memory data structure that is a python dict, keys are the hash and the numpy version of the embedding vector. Then this is saved as a pickle to S3, and loaded in-memory when I am ready to search. So you will periodically update this file as your embedded data (knowledge) changes over time.

1 Like

I’ve begun to realize that to do AI projects really well, with excellent efficiency, and financially practical – everything we learned through the ages of computing is suddenly more relevant than ever before.

2 Likes

@louis030195 Love the open source project – nicely done!

The duplicate removal method you mentioned seems to remove duplicate “exact match” strings, right? This scenario almost never happens - because due to the chunking, the strings will be off a little and not become exact duplicates. You can try it out with web content pages and you will see what I mean. What’s needed is a “semantic duplicate” removal.

We tried that as the first approach (the md5 hash) – it worked only in the case of exact match duplicates (see above). The problem is: When chunking web pages, the chunk will always be off by 1-2 characters and then you get duplicates like “Copyright 2023 : Curt Kennedy”. So a semantic duplicate removal is needed. But this approach with the hash works great as a quick spot fix.

1 Like

The best I could do in this situation, is take the lower cased hash and remove all leading and trailing whitespace before you embed and hash. Otherwise, you are going to have to implement a bunch of fine grained rules to re-format the internal contents of the text string for consistency. If you see a common pattern, like " :" and replace with ":" ← remove the leading space, you can do this in addition to lower cased hashes to reduce dupes even further. Regex is your friend!

2 Likes

Nice, thanks for sharing! I actually worked on duplicate issues in the past and used this heuristic

def string_similarity(
    str1: str, str2: str, substring_length: int = 2, case_sensitive: bool = False
) -> float:
    """
    Calculate similarity between two strings using
    https://en.wikipedia.org/wiki/S%C3%B8rensen%E2%80%93Dice_coefficient
    Computing time O(n)
    :param str1: First string to match
    :param str2: Second string to match
    :param substring_length: Optional. Length of substring to be used in calculating similarity. Default 2.
    :param case_sensitive: Optional. Whether you want to consider case in string matching. Default false;
    :return: Number between 0 and 1, with 0 being a low match score.
    """
    if not case_sensitive:
        str1 = str1.lower()
        str2 = str2.lower()

    if len(str1) < substring_length or len(str2) < substring_length:
        return 0

    m = {}
    for i in range(len(str1) - (substring_length - 1)):
        substr1 = str1[i : substring_length + i]
        m[substr1] = m.get(substr1, 0) + 1

    match = 0
    for j in range(len(str2) - (substring_length - 1)):
        substr2 = str2[j : substring_length + j]
        count = m.get(substr2, 0)

        if count > 0:
            match += 1
            m[substr2] = count - 1

    return (match * 2) / (len(str1) + len(str2) - ((substring_length - 1) * 2))

The problem with “semantic similarity” is that computing similarity using embedding to avoid re-computing embeddings seems wrong :smiley:

I could easily implement duplicate filtering using heuristics though

I suppose a middle solution would be a very fast model specialized for similarity check that runs locally or in a microservice

Interesting choice. I am not too familiar with this method, but see that it’s strengths are small strings (compared to Levenshtein Distance).

Is this not what doing a similarity check with embeddings is? I am thinking about it more, and it seems slightly counter-productive to compare a potential new string with every single other string to determine if its a duplicate when vector databases do all of this very efficiently. Although, I have never played around with processing power, but I have never had an issue with duplicates - except from small strings which are usually semantically worthless anyways when there’s no context attached. As @curt.kennedy has mentioned, I focus more on “cleaning” the string before processing it as there are typos, and simply a hundred ways to say the same thing.

Even using ada-embedding-002 isn’t too bad. A string such as:

Hello my name is Frank and I demand services. My company sells “devil shoes”. The shoes without a sole! Instead of simply buying a shoe that works out the box, people need to buy a sole that deteriorates (because it is made with mushrooms!) and requires a monthly subscription. How can you benefit me

only costs $0.000026

I imagine it really depends on what you plan on doing this the cached information.

Another thing you can try, and it isn’t optimal, is that if you find yourself in a situation where you embed lots of similar things that you can’t seem to clean-up, is take the embedding, check the dot-product (cosine similarity) against all your previous embeddings, and if it is further than 0.0001 (or whatever threshold) then you consider it new information and add it to the embedding database, otherwise you consider it similar and discard it.

Like I said, not as efficient as a Clean → Hash → Lookup → Decide, but extreme consequences sometime warrant extreme preventative measures. But I would consider this a last ditch effort.

1 Like

Yes, agreed.

In terms of efficiency, usually running a complete database cycle on every request is not the way to go.
Perhaps an occasional clean-up tool ?

1 Like

@RonaldGRuckus That would work too, just a cron job running every X hours or days on the new inputs. And a deep scan every month or two to revisit the whole database.