How do you tag data correctly?

I’m trying to understand tagging my data correctly with my embeddings database.

My attempt is to make an insurance document interactive. I have managed to create a prompt and simplify the text to make the model actually workable because legal wording is painful, but now I’m facing some response issues where it is grabbing random parts that don’t apply and applying them.

Am I meant to create a vector of the tag or what the tag is going to be used by the GPT model?

1 Like

Embeddings will have the highest vector comparison score when there is highest similarity between two documents or two texts.

Here is a typical embeddings scenario for chatbot knowledge augmentation:

  • Break documentation into chunks. Use some logic to make parts consistent, such as multiple paragraphs, distinct sections, or some overlap between larger pieces.
  • Run embeddings on each chunk of documentation data, and store the returned vector along with the data.


  • take user question input, or better, a few turns of recent conversation also, and run embeddings to obtain a vector.
  • run a similarity match between the question embeddings vector and all documentation vectors,
  • put the top results into the AI’s conversation history before the most recent question. An inserted prefix “Here is database retrieval to help AI answer user’s question:” can help the AI understand.


  • since a question the user typed in doesn’t look much like a chunk of documentation, the similarity score reported may not be as high as one wishes.
  • questions aren’t similar to answers.

innovative mitigations, using a middle state of data:

  • Have AI create some typical questions about each chunk, and create an embeddings vector that includes the questions too.
  • Have an AI create a preliminary answer using what it knows, and do a semantic search that also includes the trial answer.
  • Have AI rewrite or summarize the chunks and key points within, to improve the matching.

“Tags” aren’t really a thing, but you can try your own techniques that extract categories and meanings and topics to again improve the similarity scores.


Oh I get it now. Because I have several parts that relate to each, example:

Cancellation could mean 3 different parts of the policy, but each have their own nuances, could I include those in an entire chunk and then create separate specific chunks of each of those parts with nuances and have a filter to remove any duplicate blocks of text before sending it to the model?

I answered my own question within itself nevermind.

Nobody made a rule that you couldn’t even have two different types of embeddings or multiple AI queries to refine data and then add their similarity scores. It just takes money and more time to answer.

For example:

Have AI take the recent conversation, and make a new question that is fully-formed and contextual (avoiding simple questions like “what about cancellations?”)
Compare to a database of example questions where the answer is within each chunk.

while adding

Have AI extract the topic and keywords from recent conversation
Compare to a database of topics and keywords of each chunk.

the conversation management system also tracks contextual turns of conversation and eliminates obsolete topics from consideration.

Imagination and practicality is the limit.

1 Like

would I be better off creating a vector of each of the questions individually to attach to the text it directly relates to? or is lumping 25 individual questions together just as effective?

Lumping mixed data together would reduce the quality of matches for specific topics. The data that you must then provide the AI then must also be bigger. However, there is a point where the context of text is lost if reducing.

It depends on the type of data. Imagine this for your CEO’s biography:

chunk_1: Mailhouse was wearing a Detroit Red Wings hockey sweater, and Reeves (an avid hockey fan and a keen player of the sport) asked if Mailhouse needed a goalie. As the two men formed a friendship, they began jamming together, and were joined by Gregg Miller as the original lead guitarist and singer in 1992.

chunk_2: Reeves was born in Beirut, Lebanon, on September 2, 1964, the son of Patricia (née Taylor), a costume designer and performer, and Samuel Nowlin Reeves Jr. His mother is English, originating from Essex.[10] His American father is from Hawaii, and is of Native Hawaiian, Chinese, English, Irish, and Portuguese descent.[5][11][12] His paternal grandmother is Chinese Hawaiian.[13] His mother was working in Beirut when she met his father,

chunk_3: He plays bass guitar for the band Dogstar and pursued other endeavours such as writing and philanthropy.

Semantic matching via an embeddings engine would give precise but yet poor results on the chunks. “what celebrity was in the band Dogstar” doesn’t return useful information. “what red wings players are musicians?” or “costume designers in Hawaii” gives similarity returns less than useful.

Data augmentation can thus be useful, both in additional data when creating the embedding, and also the data provided to the AI. The improvement would be good if each chunk had metadata “Keanu Reeves Biography part 18 - As a musician part 3 - keywords: dogstar, bass, bandmates”

1 Like

I had to get GPT to simplify what im saying here:

But i have a script and the idea is that in table 1 will be the text that has a vector, in table 2 it will be the questions that have a vector but the text of that table is the same as the text in table 1. hope that makes sense. I was going to put all the questions in 1 vector and then insert that vector into table 2 column 1 and the text in tables 1 and 2 in column 2. But from what you are saying it is better i vector each question individually. This will mean that Table 2 column 2 will be repeated in the text several times but will achieve a better outcome overall?

Scenario 1: Matching Text in Both Tables

  1. User Query: “Does the policy cover lost baggage?”
  2. Vector Search 1 - Policy Text: Finds a matching policy text snippet.
  3. Vector Search 2 - Relevant Questions: Finds a relevant question that matches the query.
  4. Result: Since both searches found matching text, the matched policy text snippet is directly used to generate a response.
  5. Response: The response is generated based on the matched policy text snippet and sent to the user.

Scenario 2: No Matching Text in Either Table

  1. User Query: “Does the policy cover lost baggage?”
  2. Vector Search 1 - Policy Text: Doesn’t find a matching policy text snippet.
  3. Vector Search 2 - Relevant Questions: Doesn’t find a relevant question that matches the query.
  4. Results: Since there are no direct matches, both searches yield different results.
  5. Sending to ChatGPT Models: The results from both searches are sent to separate ChatGPT models.
  6. Responses Generation: ChatGPT models generate responses based on the separate results.
  7. Combining Responses: The generated responses from both models are combined into a single response to provide a comprehensive answer.
  8. Sending to Final ChatGPT Model: The combined response is sent to a final ChatGPT model along with instructions.
  9. Final Response Generation: The final ChatGPT model generates a response based on the combined response and instructions.
  10. Response: The response generated by the final ChatGPT model is sent to the user.

In both cases, the script aims to provide a relevant and coherent answer to the user’s query, whether by using directly matching text or by combining responses from different sources when direct matches are not found.

Edit: Im using Singlestore and not a proper vector DB. Not a programmer, just learning as I go and ChromaDB keeps throwing me errors each time I use it.

1 Like