Preparing the dataset for embeddings

I’m developing a question and answer bot to help customers. I am using embeddings with the text-embedding-ada-002 model and storing them in pinecone db. I will prepare a dataset with frequently asked questions what should I pay attention to when preparing this data set and how should I index it?

1 Like


  1. Break the document into chunks (Embeddings have token limits)
  2. Create the embedding with OpenAI
  3. Store data in vector database
  4. Create an application to query data
  5. Create embeddings for queries in real-time
  6. Display response data

The most challenging part is preparing the data, my advice is pay attention to token limits when creating chunks.


Actually, my question was about preparing the data. What can I do to index the data properly and prepare it well?

Think of sensible ways to split Question and Reply, think about some cases where the user asks two questions to the Service Desk and get two replies to avoid giving false answers. So if it is common to request account set-up and creation of a new E-Mail-Account you might geht mixed results if a user asks only for the first question.

I’d also would like to recommend to you the following video where a process is shown and explained, this one really helped me out a lot, so it might help you too: 5. OpenAI Embeddings API - Searching Financial Documents - YouTube

1 Like

Actually, I have a template in mind like this [title,category,content]. I will use the title and category when indexing. Can you think of anything to add to this

Oh actually one further thing came to my mind: Try to filter out all the words which are exchanged as a pleasantry before the actual question is proposed. So if someone Asks: “Hello Servicedesk, can you help me with the following issue:” try to take this out because if someone uses a similar phrase when doing the request with your tool you might get a higher cosine distance when this should not be the case!

Edit: Lower cosine distance

I’m already using a structure like this: " Answer the question as accurately as possible in the context below:
Context: <embedding_content>" isn’t he doing the same thing as you want to say, did I get it wrong ?

Ah okay if you have only the content thats great, what I meant was regarding the training data, and how to prep it :slight_smile:

I meant if you have a servicedesk request, some pepole might say “Hello I have a issue and would like your help”, to take this part out of the training data as it would not be relevant for the answer.

Oh, I get it.
The data will be generally clean because I’m creating it myself. Actually, that’s why I asked what I should pay attention to when creating the dataset. For example, should I keep the content of the embeds long or short, how should I index it, such as.

I also started a chatbot for a FAQ document.

My prompt : Answer the question based on the context below, and if the question can’t be answered based on the context, say "I don’t know."\n\nContext: {context}\n\n—\n\nQuestion: {question}\nAnswer:

context is were I insert my embeddings.

My embedding is a combined string: Title:…+Category:…+Question:…+Answer:…

I tried all available models but only davinci (the most expensive) is able to answer “I don’t know” if the question totally unrelated.

Thus its working but I’m not very satisfied. The answers are exact match of the FAQ. But maybe there is room for improvements? I also tried only a few questions mostly from the training data, not real world example of customer question to the helpdesk.