Hello all
I am trying to create semantic search functionality on our internal company documents. I am very new at embeddings and have absolutely no idea about vector representations, so I appologize in advance if my question is stupid.
I am following the Search-Ask approach of the official openai docs called Question_answering_using_embeddings, but rather than using the 2022 Olympics, I am using our company’s documents. I am currently on the search step:
- Collect all documents:
- Using text-embedding-ada-002, I created embeddings for each document and store them into some vector database along with there textual content. I also added some initial texts before each content. The overall structure of the documents is the following (all documents describe internal policies.):
“Title: Policy 01
section: 1
Content:
The purpose of this policy is …”
Title and section are initial texts I added. I added the title because sometimes the policy numbers are not mensioned directly in the document. The section is for chunking. If a document is chunked into 2 embeddings, there will be “Section: 1” and “section: 2” with the same title.
Now, when a user asks a question, I create embedding for the question and pass it to my database to compare for relevant documents using my database’ cosine similarity function. This is where my problem started:
Suppose there is an internal policy called policy 69. This policy talks about work from home arrangements
If I ask: “What policy describes work from home arangements”. It correctly returned the policy 69 document.
However, when I ask: “What is policy69?” It returned a completely wrong document.
I tried using openai’s cosine similarity function from the python package but got the same result, so I guess the embedding is not the problem here. I guess there is something wrong with my understanding of embeddings.
So my questions are:
- Why is my second query giving incorrect result?
- Is there a way to fix it? Perhaps on the embeddings side or do I have to do additional steps? The second question is one of the critical features of this system.
Thanks!