Question answering using embeddings-based search demonstrates a technique for finding answers to specific questions using embeddings without the need for labeled data. However, if I already have a large labeled question-answering dataset, how to leverage it to further enhance the performance?
Embed only the question then return the answer as the result during the lookup process.
It seems like what it is suggested in Question answering using embeddings-based search. How is the labeled data used?
What are the labels? The answer is likely to be very context-dependent.
I have a lot of QA data with questions and answers already. Based on that, how to improve Question answering using embeddings-based search?
Embed both.
For context, I was working on a regulatory knowledgebase. So, I chunked and embedded all the regulatory docs. I asked a question that should have been answered in a particular doc, but somehow the OpenAI model just couldn’t get it right. So, I added the question to the doc itself and re-embedded. Now, the doc comes up when that question (and similar questions) are asked.
Based on that, I actually created a Q & A dataset, like yours, with the answers to questions embedded with the questions. In fact, I wrote code to prompt the LLM to create 10 questions for each document and append to that document before it is embedded.
Needless to say, this has improved search results on those datasets.
Good idea!
Two follow-up questions:
- Do you rewrite the doc like: question 1. question 2 … question 10. original doc?
- Is it possible to do even better to teach LLM the answer of question instead of doc itself? Some questions may be too hard for LLM to understand.
I append the questions to the doc.
Regulatory doc chunk
q1
q2
q3
- Is it possible to do even better to teach LLM the answer of question instead of doc itself? Some questions may be too hard for LLM to understand.
Now, we’re back to fine-tuning. I have asked the same question: Can search responses be improved by both embeddings and fine-tuning? Basically submitting queries to a model fine-tuned on the same context data.
I have yet to receive an answer to that.
What about appending <question, answer> pairs instead of questions only?
To be clear, I am responding to this from the standpoint of this type of chat completion query process:
I am referring specifically to the docs you see in the above flowchart that are retrieved from the vector store as “context”. To each of these docs, I add the question(s) that the doc answers.
Under these circumstances, there is no reason to add question-answer pairs to the document, assuming that the document itself answers each and every question appended to it.
In my scenario (conversationally ask questions about site content), this makes little sense because I am assuming the document itself answers the question(s).
However, in your scenario, it might make perfect sense, particularly if you wish to augment the context of the document with more information.
This is definitely one of those “depends upon use case” cases.