How to create FAQ on internal company data?

There are many ways to do this. I approached this very simplistically using embeddings.

Vectorize the Questions

This term is a bit fuzzy, but it means that given the answer to a question, push it into the LLM and ask for it to give me the vector for that blob of text. This is known as an embedding vector. It’s like a little mathematical fingerprint that identifies the text inside the LLM.

Store the Embedding Vector

This is the simple part - given the embedding vector, save it in a database that makes it easy to recall. We’ll refer to this as the vector database. But don’t just save the mathematical fingerprint; save the text of the question, the answer to that question (which you already know) and save some other valuable information about the answer, such as a list of keywords about the question and the answer, perhaps the date it was created, and who created the answer This meta-data about the answer may come in handy.

The UX - Answer Questions

This is the solution part of the process. It’s where a user asks a question but has no idea what the answer is or even how to ask it. Their question is entered into a UI of some sort, and the solution uses their naturally typed question to get an embedding vector in the same manner you used to vectorize your known questions. With the embedding vector from the user’s question, query the vector database to see which of the known questions scores the closest to the user’s question vector. The query result from the vector database includes an inference score which can be used to isolate the top five hits (for example).

The Recommended Answer

With the top closest matches, the one with the highest score is probably the best match, and because we planned ahead, that matching item contains the answer content and the original question text used to instantiate the matching vector. You now have all the content required to provide the user with the best answer given that user’s question. This may not be enough to make the experience ideal. You may also need to extract keywords from the user’s question and use them to filter the highest-scoring matches.

Embedding Advantages

Embeddings have many advantages, including a training approach that is easier than a fine-tuned model. But more advantageous, this approach is very easy to update or add new precision by simply revectoring with new information. It is also financially practical because embeddings are about 1/75th as costly as other GPT inference processes. Building infrastructure to create and manage AI solutions based on embeddings is also relatively simple because you’re managing the solution like you would any content management or data management process.

Embedding Disadvantages

Vectors are not the easiest elements to manage or match in a query process. While it is possible to perform mathematics involving cosines in Python data structures and even relational databases, these are specialized and slightly more complex than simply querying for keys in an index. As such, I recommend a little reading about Pinecone or Weaviate, databases designed to simplify vector storage and queries. Another disadvantage to embeddings is inference precision. I’m not an expert in this field of AI per see, but tuning an embedding system to deliver high-confidence results for users requires some effort, and this is why I recommend at least considering other GPT services to extract keywords, summaries and even entities to enhance the data set for creating the embeddings and filtering the best answers from the vector database.

3 Likes