How do you tag data correctly?

_j · August 24, 2023, 4:00am

Embeddings will have the highest vector comparison score when there is highest similarity between two documents or two texts.

Here is a typical embeddings scenario for chatbot knowledge augmentation:

Break documentation into chunks. Use some logic to make parts consistent, such as multiple paragraphs, distinct sections, or some overlap between larger pieces.
Run embeddings on each chunk of documentation data, and store the returned vector along with the data.

then:

take user question input, or better, a few turns of recent conversation also, and run embeddings to obtain a vector.
run a similarity match between the question embeddings vector and all documentation vectors,
put the top results into the AI’s conversation history before the most recent question. An inserted prefix “Here is database retrieval to help AI answer user’s question:” can help the AI understand.

problems:

since a question the user typed in doesn’t look much like a chunk of documentation, the similarity score reported may not be as high as one wishes.
questions aren’t similar to answers.

innovative mitigations, using a middle state of data:

Have AI create some typical questions about each chunk, and create an embeddings vector that includes the questions too.
Have an AI create a preliminary answer using what it knows, and do a semantic search that also includes the trial answer.
Have AI rewrite or summarize the chunks and key points within, to improve the matching.

“Tags” aren’t really a thing, but you can try your own techniques that extract categories and meanings and topics to again improve the similarity scores.

Topic		Replies	Views
About the usage of ChatGPT Embedding API	9	4705	August 18, 2023
Embedding and searching from similar embeddings API	6	6935	October 27, 2023
Prompting with the chat/completions API against a large transcript file API	5	3748	October 4, 2023
Best way to save html files in vector store API langchain	4	7793	October 9, 2023
Embedding - text length vs accuracy? API	13	16774	December 25, 2023

How do you tag data correctly?

Related topics