How similar are vectors for a word/phrase and vector for text that includes the word/phrase

I’m using the ada002 model to create vectors for text. This is part of a RAG system with a corpus of documents where the domain and corpus model are stored in a knowledge graph. One thing I want to do is find occurrences of entities in the graph in the text of the corpus. I divide the documents into small chunks based on section headers and create vectors for each section. I also create vectors for each entity in the graph. I want to do what people call Named Entity Recognition (NER). E.g., if a section of text describes problems for product Prod1, I want to capture that in the graph. I’m currently using a simple NLP library to do that but I was thinking that if I used nearest neighbor and found sections of text with vectors that were near entities in the graph that might be a better way. It’s not working. I’m not sure if I’m doing something wrong with the API or (what I think may be more likely) I’m misunderstanding how vectors model semantics. I.e., it occurred to me that the vector created for several paragraphs that talk about a specific problem with Prod1 may actually be very far from a vector just for the name “Prod1” so my idea of using the vectors to do NER is not the way to do it. Just wanted to check if that is correct because I’m new to using the API and it is quite possible I’m doing something wrong.

Here’s a key to enlightenment: you don’t have to embed only the text chunk to be retrieved.

You can have multiple layers of embeddings that operate on the same chunks. You can have keyword embeddings, metadata embeddings, question embedding, where the result is still the same chunk to be provided as AI knowledge. Even simulated user-style questions that return that chunk.

This allows more straightforward passive search. You can de-duplicate any overlapping results.

You might want to simply give the AI a keyword search function, though, if you are making searches more akin to a database query, without telling AI that top-k on the search results use an embeddings-based threshold system using context.

1 Like