Can this api be used to query internal data?

It sure is. Works like a charm. (And very nicely explained by @curt.kennedy )

Though as you start dealing with large datasets, there are other issues that crop up when calculating the context (like duplicates). And then once you’ve eliminated the duplicates, you will look at the resulting context and say “Huh? Can I make this context better for my use case?” – so that kicks off a “post vector search” optimization initiative.


Ergo, this is where mere mortals realize - “I should have used CustomGPT so my company could be using this three weeks ago.”

hey guys following up on this discussion

I’m actually maintainer of an open source api that solves this problem of connecting your data to ChatGPT/GPT-4 or building plugins. GitHub - different-ai/embedbase: A dead-simple API to build LLM-powered apps

Regarding duplicates,

embedbase never compute embedding twice through some tricks, saving everyone a lot of money

you can also do semantic search across multiple datasets

feel free to give a try by running it yourself or trying the free hosted version :slight_smile:


To avoid duplicates, I keep track of what I have embedded previously with the HASH of the previous embedding text, and use this as an index into my embedding database.

For example, “mouse” has a Sha3-256 hash of

import hashlib

X = "mouse"
H = hashlib.sha3_256(X.encode())
print(f"HASH: {H.hexdigest()}")
# HASH: 6ca66ca0a1713279fbdc1d0f5568d033d3593004251c3f564186a4e4e284cdeb

Then whenever I embed anything else, I compute the hash, and see if it is in my embedding database, if it is, then I don’t have to embed it, I just pull the previous embedding vector. You won’t have duplicates if you do this!

Note that GPT is case sensitive, so “Mouse” is different than “mouse”, and luckily this results in a separate hash too:

import hashlib

X = "Mouse"
H = hashlib.sha3_256(X.encode())
print(f"HASH: {H.hexdigest()}")
# HASH: 4c2e2fe9ae1d56701bea18593b67dc59d862106f959a132c640352780b5d0339

You can go with lower-case hashes too, but realize GPT “sees” that “Mouse” is different than “mouse”.

Note: Sha3-256 is probably overkill, but that’s what I use these days.

Oh, and to be clear, this is only on the database/lookup side. In my case, for search, I scan the database to create an in-memory data structure that is a python dict, keys are the hash and the numpy version of the embedding vector. Then this is saved as a pickle to S3, and loaded in-memory when I am ready to search. So you will periodically update this file as your embedded data (knowledge) changes over time.

1 Like

I’ve begun to realize that to do AI projects really well, with excellent efficiency, and financially practical – everything we learned through the ages of computing is suddenly more relevant than ever before.


@louis030195 Love the open source project – nicely done!

The duplicate removal method you mentioned seems to remove duplicate “exact match” strings, right? This scenario almost never happens - because due to the chunking, the strings will be off a little and not become exact duplicates. You can try it out with web content pages and you will see what I mean. What’s needed is a “semantic duplicate” removal.

We tried that as the first approach (the md5 hash) – it worked only in the case of exact match duplicates (see above). The problem is: When chunking web pages, the chunk will always be off by 1-2 characters and then you get duplicates like “Copyright 2023 : Curt Kennedy”. So a semantic duplicate removal is needed. But this approach with the hash works great as a quick spot fix.

1 Like

The best I could do in this situation, is take the lower cased hash and remove all leading and trailing whitespace before you embed and hash. Otherwise, you are going to have to implement a bunch of fine grained rules to re-format the internal contents of the text string for consistency. If you see a common pattern, like " :" and replace with ":" ← remove the leading space, you can do this in addition to lower cased hashes to reduce dupes even further. Regex is your friend!


Nice, thanks for sharing! I actually worked on duplicate issues in the past and used this heuristic

def string_similarity(
    str1: str, str2: str, substring_length: int = 2, case_sensitive: bool = False
) -> float:
    Calculate similarity between two strings using
    Computing time O(n)
    :param str1: First string to match
    :param str2: Second string to match
    :param substring_length: Optional. Length of substring to be used in calculating similarity. Default 2.
    :param case_sensitive: Optional. Whether you want to consider case in string matching. Default false;
    :return: Number between 0 and 1, with 0 being a low match score.
    if not case_sensitive:
        str1 = str1.lower()
        str2 = str2.lower()

    if len(str1) < substring_length or len(str2) < substring_length:
        return 0

    m = {}
    for i in range(len(str1) - (substring_length - 1)):
        substr1 = str1[i : substring_length + i]
        m[substr1] = m.get(substr1, 0) + 1

    match = 0
    for j in range(len(str2) - (substring_length - 1)):
        substr2 = str2[j : substring_length + j]
        count = m.get(substr2, 0)

        if count > 0:
            match += 1
            m[substr2] = count - 1

    return (match * 2) / (len(str1) + len(str2) - ((substring_length - 1) * 2))

The problem with “semantic similarity” is that computing similarity using embedding to avoid re-computing embeddings seems wrong :smiley:

I could easily implement duplicate filtering using heuristics though

I suppose a middle solution would be a very fast model specialized for similarity check that runs locally or in a microservice

Interesting choice. I am not too familiar with this method, but see that it’s strengths are small strings (compared to Levenshtein Distance).

Is this not what doing a similarity check with embeddings is? I am thinking about it more, and it seems slightly counter-productive to compare a potential new string with every single other string to determine if its a duplicate when vector databases do all of this very efficiently. Although, I have never played around with processing power, but I have never had an issue with duplicates - except from small strings which are usually semantically worthless anyways when there’s no context attached. As @curt.kennedy has mentioned, I focus more on “cleaning” the string before processing it as there are typos, and simply a hundred ways to say the same thing.

Even using ada-embedding-002 isn’t too bad. A string such as:

Hello my name is Frank and I demand services. My company sells “devil shoes”. The shoes without a sole! Instead of simply buying a shoe that works out the box, people need to buy a sole that deteriorates (because it is made with mushrooms!) and requires a monthly subscription. How can you benefit me

only costs $0.000026

I imagine it really depends on what you plan on doing this the cached information.

Another thing you can try, and it isn’t optimal, is that if you find yourself in a situation where you embed lots of similar things that you can’t seem to clean-up, is take the embedding, check the dot-product (cosine similarity) against all your previous embeddings, and if it is further than 0.0001 (or whatever threshold) then you consider it new information and add it to the embedding database, otherwise you consider it similar and discard it.

Like I said, not as efficient as a Clean → Hash → Lookup → Decide, but extreme consequences sometime warrant extreme preventative measures. But I would consider this a last ditch effort.

1 Like

Yes, agreed.

In terms of efficiency, usually running a complete database cycle on every request is not the way to go.
Perhaps an occasional clean-up tool ?

1 Like

@RonaldGRuckus That would work too, just a cron job running every X hours or days on the new inputs. And a deep scan every month or two to revisit the whole database.

Seeing how much garbage was collected and removed in one big cleanup would be so satisfying as well. My favorite parts of database & data transfer optimization is reducing the load and counting up how many (as of now, pennies) I am saving

1 Like

Hi @wfhbrian

I am really overwhelmed by the amount of information received about it that I don’t know if I will be able to assimilate

I am not an expert in AI, nor in machine learning. I have done a little R language course for data analysis but I am not proficient in statistics or related algorithms.

Right now what I am using in my company is SaaS (Search as a Service) through Azure, that is, in a storage account I load all the documents that I want to consult and that in a first test are about 180,000 files within thousands of folders and subfolders. Once stored, you create an index and an indexer and you can now consult all that information without further ado.

Azure also has tools to create chat bots using its Language Studio service but it doesn’t work for me because you have to add the files by hand one by one and that task would be unthinkable.

I’ve been using chatGPT for months in general for my life and since I saw its potential, the usefulness of being able to consult all the information in a much more human way, using natural language and, above all, being able to make queries that receive not only those documents that include the searched term but rather that an answer to a given question is received and I understand that chatGPT could be a solution as you propose but now my big question is how to set up all this setup.

First of all I understand that I need to learn Python when I come from the .NET world. In addition, the first step, which would be to create the embeddings, I don’t know very well what it consists of and as far as I have been able to read, I think that the most feasible solution is. the one that curt kennedy comments, so I will try to study everything more thoroughly to be able to continue at least knowing what to ask

Thanks for the help

I think this will soon change. Microsoft is on a tear to reinvent all of its services in ways that conceal the AI DevOps that we all struggle with. The right step (for you) might be no step at all. You have a sizeable footprint and a perfect test bed for the new Microsoft AI beta - perhaps you should petition them to join their beta program.

The index is probably an inverted Elastic-like approach. It may make sense to simply change (or augment) the indexing approach to use embeddings; then build only the ChatGPT UI to blend similarity hits with GPT. This ChatGPT client project comes to mind for the UX as does this one. This approach may represent a shorter gap between what you now have and where you want to be.

Hi @wfhbrian
I have been busy with the delivery of a project and I have not been able to follow the thread of this topic. Now I return to it and after reading the documentation on what are vector embeddings, vector databases, the search for similarities, and some other concepts I think I remember that you offered to have a talk in which you could advise me on this matter. My English is very poor so if you like we can start talking about this consultancy privately by mail

Hi @curt.kennedy
First of all thanks for your help.

These days I am reading a lot of documentation, especially from the pinecone website, about the embeddings and everything points to the fact that the first thing I have to do is generate these vector embeddings from the text that I want to be consulted and if I have understood you correctly I must first convert the content of the different word, excel and pdf files into plain text? That is, should I generate a txt file for each source file? and then what tools can I use to generate the vectors