Coding technics - Vector search

gyorgy.rozgonyi · October 31, 2023, 4:34pm

Faiss vector index
Could someone help me and explain how to cross-search between two datasets?
I have a large text document with potential near matches of my master data.
I can programmatically access my metadata records and put them into a list or convert them to an index.
How do I create a simple query listing the top N matches of metadata records? All I managed to do was a simple cycle to check the match of metadata records one by one, but I think there should be a more elegant way of doing it.
THX

Documents → chunks → FAISS.from_texts (chunks, OpenAI ada embeddings) → index
For each record
index.similarity_search_with_relevance_scores()
Find and keep the top N.

curt.kennedy · October 31, 2023, 4:40pm

With the embedding vectors from ada-002, you would only need to take the dot product, which is the same as cosine similarity, then record the top K indices, then when completed you retrieve the text at these top K indices.

acharneski · October 31, 2023, 7:50pm

I think you are looking for M-Tree - I imagine some vector databases provide this

curt.kennedy · October 31, 2023, 8:10pm

How many embedding vectors are we talking here?

I just ran a benchmark using AWS Lambda. Python 3.11 and both ARM64 and x86_64, and searching 400,000 embeddings takes less than 1 second.

The best processor now is the ARM64 one.

To get Numpy for ARM64, just add the AWS provided layer AWSSDKPandas-Python311-Arm64

The main limitation is memory (see printouts below).

On my local machine, I can search ~1 million vectors x 1536 dims in less than 1 second.

But here are the public AWS cloud benchmarks if you want to go with Lambda.

MIPS naive: Elapsed time 0.9262909889221191
MIPS extremes: Elapsed time 0.9536223411560059
MPIS vectorized: Elapsed time 5.194347858428955
MIPS top: Elapsed time 0.9222877025604248
400000 x 1536 (# vectors, # dims per vector)
Memory Size: 10240 MB Max Memory Used: 9526 MB
Python 3.11, x86_64

MIPS naive: Elapsed time 0.8463380336761475
MIPS extremes: Elapsed time 0.8570494651794434
MPIS vectorized: Elapsed time 4.841116905212402
MIPS top: Elapsed time 0.841890811920166
400000 x 1536 (# vectors, # dims per vector)
Memory Size: 10240 MB Max Memory Used: 9532 MB
Python 3.11, ARM64

Believe it or not, a for-loop over an enumerate iterator is faster than a vectorized numpy operation.

Here is the MIPS top code (+supporting function) I wrote that is faster than direct vectorization in numpy.


def sortPerm(x,reverseFlag=False):
    if reverseFlag:
        idx = np.argsort(-x)
    else:
        idx = np.argsort(x)

    s = x[idx]
    p = idx
    return s, p

def mips_top(q, vecs, Nextremes):
    NegInf = float('-inf')
    TopN = np.array([NegInf] * Nextremes)
    TopIdx = np.array([-1] * Nextremes)
    mTopN = float('-inf')

    for i, v in enumerate(vecs):
        c = np.dot(q,v)
        if c > mTopN: 
            # print(f"Found new TopN: {c}")

            TopN = np.append(TopN,c) # append the new value
            TopIdx = np.append(TopIdx,i) # append the new indice
            TopN, Perm = sortPerm(TopN,True) # arrange from largest to smallest  by setting 'True' flag
            TopIdx = TopIdx[Perm]
            TopN = TopN[:Nextremes]
            TopIdx = TopIdx[:Nextremes]
            mTopN = np.min(TopN)

    return {"TopIdx":TopIdx,
            "TopN":TopN}

MIPS stands for “Maximum Inner Product Search”, which is what we are doing here for unit embedding vectors.

Topic		Replies	Views
Embeddings advice API	4	1496	May 5, 2023
Retrieval from Azure Index (as Vector Database) Community vector-db , rag , azure , persistent-storage	0	190	August 8, 2024
`text-embedding-ada-002` API	23	16919	February 6, 2024
Recommendations for CPU-Based Real-Time Vector Database Indexing and Matching? API	1	522	October 11, 2023
Semantic search through embeddings API	3	1281	January 22, 2023

Coding technics - Vector search

Related topics