The problem in Q&A system based on custom knowledge base

harryporter319193 · August 3, 2023, 7:26pm

Hello
Now I’m developing Q&A system with Pinecone.
Clients can upload pdf files and embed them to Pinecone.
But in my case, there are several documents which is similar in content.
For example, one doc is for financial statement of Google and the other is for Microsoft.
When asking a question, system retrieves data which is similar to query from Pinecone.
But when query is this case, for example “Who is director of Google?”, system retrieves data from Microsoft doc because Google doc is not enough.
Then, the answer is wrong.
I think to fix this problem, I need to tune embedding size.
I hope your better reply.
Thanks.

RonaldGRuckus · August 3, 2023, 7:53pm

You can either use hybrid searching and include a keyword-based search like BM25. Or use a filter. Why do you want to “tune the embedding size”?

harryporter319193 · August 3, 2023, 8:08pm

Thanks for your quick reply.
Could you explain in more detail?

RonaldGRuckus · August 3, 2023, 8:12pm

Sure,

Using something like BM25 alongside semantic embeddings helps differentiate between “Microsoft Financials” and “Google Financials” by focusing on the keywords (Microsoft vs Google), which I believe is what you are looking for.

You can also use the keywords as a filter and speed up your results (but possibly risk losing important documents) or combine it when you are performing your similarity search.

Pinecone offers both of these options (each option in the article above)

harryporter319193 · August 3, 2023, 8:57pm

Thanks.
You gave me a golden idea.
Thank you very much.

harryporter319193 · August 4, 2023, 8:27am

Can you answer my question?
What is the principle of sparse vector search?

Topic		Replies	Views
Embedding and searching from similar embeddings API	6	3944	October 27, 2023
Reducing Cost of GPT 4 by using embeddings Prompting	23	8427	May 4, 2023
Is the OpenAI Embedding working well in the NodeJS? API embeddings	11	3268	March 6, 2024
Embedding - text length vs accuracy? API	13	10092	December 25, 2023
Semantic search using uploaded files (only performs lexical search for me) API	19	1674	January 30, 2024

The problem in Q&A system based on custom knowledge base

Related Topics