How vector search is able to match exact keyword (even words which are randomly generated and have no meaning)

I’m doing some POC for my LLM based project, and for that I’m using Vector Database for Document Retrieval (IR).

Recently, I came across a few blogs from some of the most famous Vector Databases which suggested using hybrid search (Vector Search + Keyword Search) for better IR. That too mainly helps with Domain Specific keywords.

So before I start implementing Hybrid Search I thought of doing some tests and surprisingly found that all those blogs are wrong because, with Vector Search, I’m able to match Domain specific keywords from the query.

My Testing

  • Generated some keywords that don’t have any meaning and moreover, doesn’t exists
  • I’m using ChromaDB as vector database which uses hnswlib for ANN

Sample Documents

{
    "document_name": "Return Policy",
    "Category": "Fashion",
    "Product Name": "Zinsace",
    "Policy": "Customers can return the product within 14 days of purchase if it is unworn, with all tags attached and in its original condition. A refund will be provided in the original form of payment. However, customized or personalized Zinsace products are non-returnable."
},
{
    "document_name": "Return Policy",
    "Category": "Electronics",
    "Product Name": "Zisava",
    "Policy": "Customers can return the product within 30 days of purchase if it is unopened and in its original packaging. A refund will be issued in the original form of payment, excluding any shipping fees. However, Zisava products that have been used or show signs of damage are non-returnable."
},
{
    "document_name": "Return Policy",
    "Category": "Fashion",
    "Product Name": "Zinsape",
    "Policy": "Customers can return the product within 14 days of purchase if it is unworn, with all tags attached and in its original condition. A refund will be provided in the original form of payment. However, customized or personalized Zinsape products are non-returnable."
},
{
    "document_name": "Return Policy",
    "Category": "Electronics",
    "Product Name": "Zisada",
    "Policy": "Customers can return the product within 30 days of purchase if it is unopened and in its original packaging. A refund will be issued in the original form of payment, excluding any shipping fees. However, Zisada products that have been used or show signs of damage are non-returnable."
}

Script to Index & Search

import uuid

import chromadb
from chromadb.config import Settings
from chromadb.utils import embedding_functions

from hybrid.dummy_data import DUMMY_DATA

client = chromadb.Client(Settings(
    chroma_db_impl="duckdb+parquet",
    persist_directory="./hybrid"
))

openai_ef = embedding_functions.OpenAIEmbeddingFunction(
    api_key="XXXX",
    model_name="text-embedding-ada-002"
)

st_ef = embedding_functions.SentenceTransformerEmbeddingFunction(model_name='all-mpnet-base-v2')

# st_ef_mini = embedding_functions.SentenceTransformerEmbeddingFunction()

texts = [doc['Policy'] for doc in DUMMY_DATA]

metadatas = [{k: v for k, v in d.items() if k != 'Policy'} for d in DUMMY_DATA]

collection = client.get_or_create_collection(name="mpnet", metadata={'hnsw:space': 'l2'},
                                             embedding_function=st_ef)
ids = [str(uuid.uuid4()) for _ in texts]

collection.add(
    documents=texts,
    metadatas=metadatas,
    ids=ids
)

res = collection.query(
    query_texts=["I want to return Zinsace"],
    n_results=10
)

print(res.get('documents'))
  • Output
    [['Customers can return the product within 14 days of purchase if it is unworn, with all tags attached and in its original condition. A refund will be provided in the original form of payment. However, customized or personalized **Zinsace** products are non-returnable.', 'Customers can return the product within 30 days of purchase if it is unopened and in its original packaging. A refund will be issued in the original form of payment, excluding any shipping fees. However, **Zisada** products that have been used or show signs of damage are non-returnable.', 'Customers can return the product within 30 days of purchase if it is unopened and in its original packaging. A refund will be issued in the original form of payment, excluding any shipping fees. However, **Zisava** products that have been used or show signs of damage are non-returnable.']]

Output Analysis

  • I’ve used 3 models for embeddings
    • text-embedding-ada-002
    • all-mpnet-base-v2
    • all-MiniLM-L6-v2
  • I indexed some documents related to refund policy with product names which are very random (which doesn’t have meaning & doesn’t exist)
  • When I tried query I want to return Zinsace or I want to buy Zinsace, with all 3 embedding models first result returned is always correct and it is able to do exact keyword match

This led me into the confusion of how these models are able to generate embeddings that can do exact keyword matches as well, and that too for words that those models have never seen before.

If vector search is able to do keyword match why all vector database guys suggests using Hybrid Search. Haven’t they tested properly? or Are they in any bias?

2 Likes

A lot of it depends on the tokenizer for the embedding model - some tokenizers split unknown works into multiple tokens. Afaik the models you mentioned don’t do this, but don’t quote me on that.

EDIT: I just tried your example with Milvus and got the same results. Interesting - I’ll dig deeper into it.

1 Like