How to best prepare a FAQ document for embeddings

Tetramatrix · August 16, 2023, 1:53pm

I’m trying to embed a FAQ document and I read about this example: OpenAI Platform specifically the Amazon review example here: https://github.com/openai/openai-cookbook/blob/8e6e058c6a1eefdcc7a69b4c38c39e19495716b4/examples/Obtain_dataset.ipynb.
As you can see for the embedding only the title and text text column is used.

My FAQ documents has also multiple rows, e.g. metadata. The columns are title (subject), summary, category, question, answer. What is a good approach to embedding? The most relevant columns are of course question and answer. Shall I also omit the other columns?

jwatte · August 16, 2023, 3:23pm

What I would try first, would be to build two embedding values per question: One for just the question, and one for the concatenation of all the texts.

The reason this might work well, is that questions have the same shape, but answers have a slightly different shape, so the question part of the FAQ might be a better match alone, than the entire answer, BUT if the question isn’t one of the pre-canned questions, the full text might still match well enough.

This means that two different embedding vectors map to the same document, so you might get duplicate hits in the look-up, and need to de-duplicate before presenting to the user.

Tetramatrix · August 16, 2023, 3:34pm

What do you mean? In the example they use a combined column:

df_faq["combined"] = (
    "Title: " + df_faq.subject.str.strip() + "Category: " + df_faq.subject.str.strip() + "; Issue/Question: "+ df_faq.question.str.strip() + "; Solution: " +
df_faq.Solution.str.strip())

How can I compute 2 embeddings from this?

jwatte · August 17, 2023, 4:41pm

First, that “combined” prompt won’t work well because it jams the “Category” word right into the end of whatever the subject string is. If the subject ends with “foo” then the word “fooCategory:” will be found in the combined string.

I’m assuming that you have something that calculates embeddings from strings, i e, it implements the function [Document] -> [EmbeddingVector]

I assume you also have something that stores embedding vectors with some key (or index), i e it implements the function EmbeddingVector -> Index (or, more likely. retrieves N closest with EmbeddingVector -> [(Index, Score)])

You then build a document retrieval system by building a separate table of Index -> Document on the side.

There’s nothing saying the Document you provide in, must be the exact string that you return as Document in the index. There’s also nothing saying you must only have one Index per physical Document. Thus, to build two embeddings for a document, you’d do something like:

embedding1 = computeEmbedding(document.question)
index = vectorIndex.add(embedding1)
database.addDocument(index, document)
embedding2 = computeEmbedding(document.title + "\n" + document.subject + "\n" + document.category + "\n" + document.answer)
index = vectorIndex.add(embedding2)
database.addDocument(index, document)

Of course, exactly how you sort this out, depends on the specifics of your application.

Tetramatrix · August 18, 2023, 7:57am

OK, I see. Thank you! I think about it. My chatbot seems to work. It uses the embeddings. Is there a way to add more variation? I tried temperature option but mostly it uses the exact embedding. Perhaps with the prompt? My prompt is from the example: Answer the question based on the context below, and if the question can’t be answered based on the context, say "I don’t know"\n\nContext: {context}\n\n—\n\nQuestion: {question}\nAnswer:

jwatte · August 18, 2023, 8:14pm

Typically, you will use the N best matches, rather than just one match, for the context.
(This of course consumes more tokens)

“more variation” in a document retrieval system could be considered a bug, not a desired feature
If you really want more variation, try generating a random word, and pre-pending it to the question. Something like:

fnord = randomWord()

question = f"Case {fnord}; Question: {questionText}"

Topic		Replies	Views
Preparing the dataset for embeddings API	10	4482	December 17, 2023
I read about embeddings and I want to try it. How to start? Community embeddings , chatgpt , api	2	3576	August 11, 2023
How to create FAQ on internal company data? API	6	2763	December 18, 2023
FineTune GPT3 model to work as Chatbot knowledge context question Prompting	5	1183	December 16, 2023
What's better for the type of chatbot I am building? Fine tune or embedding? Community chatgpt , api	10	1416	August 20, 2023

How to best prepare a FAQ document for embeddings

Related Topics