How to best prepare a FAQ document for embeddings

I’m trying to embed a FAQ document and I read about this example: OpenAI Platform specifically the Amazon review example here: https://github.com/openai/openai-cookbook/blob/8e6e058c6a1eefdcc7a69b4c38c39e19495716b4/examples/Obtain_dataset.ipynb.
As you can see for the embedding only the title and text text column is used.

My FAQ documents has also multiple rows, e.g. metadata. The columns are title (subject), summary, category, question, answer. What is a good approach to embedding? The most relevant columns are of course question and answer. Shall I also omit the other columns?

What I would try first, would be to build two embedding values per question: One for just the question, and one for the concatenation of all the texts.

The reason this might work well, is that questions have the same shape, but answers have a slightly different shape, so the question part of the FAQ might be a better match alone, than the entire answer, BUT if the question isn’t one of the pre-canned questions, the full text might still match well enough.

This means that two different embedding vectors map to the same document, so you might get duplicate hits in the look-up, and need to de-duplicate before presenting to the user.

What do you mean? In the example they use a combined column:

df_faq["combined"] = (
    "Title: " + df_faq.subject.str.strip() + "Category: " + df_faq.subject.str.strip() + "; Issue/Question: "+ df_faq.question.str.strip() + "; Solution: " +
df_faq.Solution.str.strip())

How can I compute 2 embeddings from this?

First, that “combined” prompt won’t work well because it jams the “Category” word right into the end of whatever the subject string is. If the subject ends with “foo” then the word “fooCategory:” will be found in the combined string.

I’m assuming that you have something that calculates embeddings from strings, i e, it implements the function [Document] -> [EmbeddingVector]

I assume you also have something that stores embedding vectors with some key (or index), i e it implements the function EmbeddingVector -> Index (or, more likely. retrieves N closest with EmbeddingVector -> [(Index, Score)])

You then build a document retrieval system by building a separate table of Index -> Document on the side.

There’s nothing saying the Document you provide in, must be the exact string that you return as Document in the index. There’s also nothing saying you must only have one Index per physical Document. Thus, to build two embeddings for a document, you’d do something like:

embedding1 = computeEmbedding(document.question)
index = vectorIndex.add(embedding1)
database.addDocument(index, document)
embedding2 = computeEmbedding(document.title + "\n" + document.subject + "\n" + document.category + "\n" + document.answer)
index = vectorIndex.add(embedding2)
database.addDocument(index, document)

Of course, exactly how you sort this out, depends on the specifics of your application.

1 Like

OK, I see. Thank you! I think about it. My chatbot seems to work. It uses the embeddings. Is there a way to add more variation? I tried temperature option but mostly it uses the exact embedding. Perhaps with the prompt? My prompt is from the example: Answer the question based on the context below, and if the question can’t be answered based on the context, say "I don’t know"\n\nContext: {context}\n\n—\n\nQuestion: {question}\nAnswer:

Typically, you will use the N best matches, rather than just one match, for the context.
(This of course consumes more tokens)

“more variation” in a document retrieval system could be considered a bug, not a desired feature :slight_smile:
If you really want more variation, try generating a random word, and pre-pending it to the question. Something like:

fnord = randomWord()

question = f"Case {fnord}; Question: {questionText}"
1 Like