The problem in Q&A system based on custom knowledge base

Hello
Now I’m developing Q&A system with Pinecone.
Clients can upload pdf files and embed them to Pinecone.
But in my case, there are several documents which is similar in content.
For example, one doc is for financial statement of Google and the other is for Microsoft.
When asking a question, system retrieves data which is similar to query from Pinecone.
But when query is this case, for example “Who is director of Google?”, system retrieves data from Microsoft doc because Google doc is not enough.
Then, the answer is wrong.
I think to fix this problem, I need to tune embedding size.
I hope your better reply.
Thanks.

3 Likes

You can either use hybrid searching and include a keyword-based search like BM25. Or use a filter. Why do you want to “tune the embedding size”?

3 Likes

Thanks for your quick reply.
Could you explain in more detail?

Sure,

Using something like BM25 alongside semantic embeddings helps differentiate between “Microsoft Financials” and “Google Financials” by focusing on the keywords (Microsoft vs Google), which I believe is what you are looking for.

You can also use the keywords as a filter and speed up your results (but possibly risk losing important documents) or combine it when you are performing your similarity search.

Pinecone offers both of these options (each option in the article above)

1 Like

Thanks.
You gave me a golden idea.
Thank you very much.

1 Like

Can you answer my question?
What is the principle of sparse vector search?