How do you tag data correctly?

Embeddings will have the highest vector comparison score when there is highest similarity between two documents or two texts.

Here is a typical embeddings scenario for chatbot knowledge augmentation:

  • Break documentation into chunks. Use some logic to make parts consistent, such as multiple paragraphs, distinct sections, or some overlap between larger pieces.
  • Run embeddings on each chunk of documentation data, and store the returned vector along with the data.

then:

  • take user question input, or better, a few turns of recent conversation also, and run embeddings to obtain a vector.
  • run a similarity match between the question embeddings vector and all documentation vectors,
  • put the top results into the AI’s conversation history before the most recent question. An inserted prefix “Here is database retrieval to help AI answer user’s question:” can help the AI understand.

problems:

  • since a question the user typed in doesn’t look much like a chunk of documentation, the similarity score reported may not be as high as one wishes.
  • questions aren’t similar to answers.

innovative mitigations, using a middle state of data:

  • Have AI create some typical questions about each chunk, and create an embeddings vector that includes the questions too.
  • Have an AI create a preliminary answer using what it knows, and do a semantic search that also includes the trial answer.
  • Have AI rewrite or summarize the chunks and key points within, to improve the matching.

“Tags” aren’t really a thing, but you can try your own techniques that extract categories and meanings and topics to again improve the similarity scores.

5 Likes