How to improve embeddings-based search when I have labeled QA dataset?

andrewzheng1618 · May 20, 2023, 2:33am

Question answering using embeddings-based search demonstrates a technique for finding answers to specific questions using embeddings without the need for labeled data. However, if I already have a large labeled question-answering dataset, how to leverage it to further enhance the performance?

wfhbrian · May 20, 2023, 1:48pm

Embed only the question then return the answer as the result during the lookup process.

andrewzheng1618 · May 20, 2023, 3:51pm

It seems like what it is suggested in Question answering using embeddings-based search. How is the labeled data used?

wfhbrian · May 20, 2023, 5:01pm

What are the labels? The answer is likely to be very context-dependent.

andrewzheng1618 · May 20, 2023, 5:24pm

I have a lot of QA data with questions and answers already. Based on that, how to improve Question answering using embeddings-based search?

SomebodySysop · May 20, 2023, 8:22pm

Embed both.

For context, I was working on a regulatory knowledgebase. So, I chunked and embedded all the regulatory docs. I asked a question that should have been answered in a particular doc, but somehow the OpenAI model just couldn’t get it right. So, I added the question to the doc itself and re-embedded. Now, the doc comes up when that question (and similar questions) are asked.

Based on that, I actually created a Q & A dataset, like yours, with the answers to questions embedded with the questions. In fact, I wrote code to prompt the LLM to create 10 questions for each document and append to that document before it is embedded.

Needless to say, this has improved search results on those datasets.

andrewzheng1618 · May 20, 2023, 9:05pm

Good idea!

Two follow-up questions:

Do you rewrite the doc like: question 1. question 2 … question 10. original doc?
Is it possible to do even better to teach LLM the answer of question instead of doc itself? Some questions may be too hard for LLM to understand.

SomebodySysop · May 20, 2023, 9:12pm

I append the questions to the doc.

Regulatory doc chunk

q1
q2
q3

Is it possible to do even better to teach LLM the answer of question instead of doc itself? Some questions may be too hard for LLM to understand.

Now, we’re back to fine-tuning. I have asked the same question: Can search responses be improved by both embeddings and fine-tuning? Basically submitting queries to a model fine-tuned on the same context data.

I have yet to receive an answer to that.

andrewzheng1618 · May 20, 2023, 10:01pm

What about appending <question, answer> pairs instead of questions only?

SomebodySysop · May 20, 2023, 10:37pm

To be clear, I am responding to this from the standpoint of this type of chat completion query process:

I am referring specifically to the docs you see in the above flowchart that are retrieved from the vector store as “context”. To each of these docs, I add the question(s) that the doc answers.

Under these circumstances, there is no reason to add question-answer pairs to the document, assuming that the document itself answers each and every question appended to it.

In my scenario (conversationally ask questions about site content), this makes little sense because I am assuming the document itself answers the question(s).

However, in your scenario, it might make perfect sense, particularly if you wish to augment the context of the document with more information.

This is definitely one of those “depends upon use case” cases.

Topic		Replies	Views
Fine-Tuning plus Embedding API	2	4883	May 3, 2023
How to create FAQ on internal company data? API	6	4483	December 18, 2023
Questions about the embedding-based chatbot API embedding	4	155	December 15, 2024
Feeding data then ask questions about it API	1	1544	February 28, 2024
Purpose of embedding models API api	6	721	February 5, 2024

How to improve embeddings-based search when I have labeled QA dataset?

Regulatory doc chunk

Related topics