Embeddings will have the highest vector comparison score when there is highest similarity between two documents or two texts.
Here is a typical embeddings scenario for chatbot knowledge augmentation:
- Break documentation into chunks. Use some logic to make parts consistent, such as multiple paragraphs, distinct sections, or some overlap between larger pieces.
- Run embeddings on each chunk of documentation data, and store the returned vector along with the data.
then:
- take user question input, or better, a few turns of recent conversation also, and run embeddings to obtain a vector.
- run a similarity match between the question embeddings vector and all documentation vectors,
- put the top results into the AI’s conversation history before the most recent question. An inserted prefix “Here is database retrieval to help AI answer user’s question:” can help the AI understand.
problems:
- since a question the user typed in doesn’t look much like a chunk of documentation, the similarity score reported may not be as high as one wishes.
- questions aren’t similar to answers.
innovative mitigations, using a middle state of data:
- Have AI create some typical questions about each chunk, and create an embeddings vector that includes the questions too.
- Have an AI create a preliminary answer using what it knows, and do a semantic search that also includes the trial answer.
- Have AI rewrite or summarize the chunks and key points within, to improve the matching.
“Tags” aren’t really a thing, but you can try your own techniques that extract categories and meanings and topics to again improve the similarity scores.