Preparing the dataset for embeddings

I’m developing a question and answer bot to help customers. I am using embeddings with the text-embedding-ada-002 model and storing them in pinecone db. I will prepare a dataset with frequently asked questions what should I pay attention to when preparing this data set and how should I index it?


  1. Break the document into chunks (Embeddings have token limits)
  2. Create the embedding with OpenAI
  3. Store data in vector database
  4. Create an application to query data
  5. Create embeddings for queries in real-time
  6. Display response data

The most challenging part is preparing the data, my advice is pay attention to token limits when creating chunks.

1 Like

Actually, my question was about preparing the data. What can I do to index the data properly and prepare it well?

Think of sensible ways to split Question and Reply, think about some cases where the user asks two questions to the Service Desk and get two replies to avoid giving false answers. So if it is common to request account set-up and creation of a new E-Mail-Account you might geht mixed results if a user asks only for the first question.

I’d also would like to recommend to you the following video where a process is shown and explained, this one really helped me out a lot, so it might help you too: 5. OpenAI Embeddings API - Searching Financial Documents - YouTube

1 Like

Actually, I have a template in mind like this [title,category,content]. I will use the title and category when indexing. Can you think of anything to add to this

Oh actually one further thing came to my mind: Try to filter out all the words which are exchanged as a pleasantry before the actual question is proposed. So if someone Asks: “Hello Servicedesk, can you help me with the following issue:” try to take this out because if someone uses a similar phrase when doing the request with your tool you might get a higher cosine distance when this should not be the case!

Edit: Lower cosine distance

I’m already using a structure like this: " Answer the question as accurately as possible in the context below:
Context: <embedding_content>" isn’t he doing the same thing as you want to say, did I get it wrong ?

Ah okay if you have only the content thats great, what I meant was regarding the training data, and how to prep it :slight_smile:

I meant if you have a servicedesk request, some pepole might say “Hello I have a issue and would like your help”, to take this part out of the training data as it would not be relevant for the answer.

Oh, I get it.
The data will be generally clean because I’m creating it myself. Actually, that’s why I asked what I should pay attention to when creating the dataset. For example, should I keep the content of the embeds long or short, how should I index it, such as.