Preparing the dataset for embeddings

klcogluberk · March 21, 2023, 9:55am

I’m developing a question and answer bot to help customers. I am using embeddings with the text-embedding-ada-002 model and storing them in pinecone db. I will prepare a dataset with frequently asked questions what should I pay attention to when preparing this data set and how should I index it?

dominiconorton · March 21, 2023, 10:09am

Process:

Break the document into chunks (Embeddings have token limits)
Create the embedding with OpenAI
Store data in vector database
Create an application to query data
Create embeddings for queries in real-time
Display response data

The most challenging part is preparing the data, my advice is pay attention to token limits when creating chunks.

klcogluberk · March 21, 2023, 10:23am

Actually, my question was about preparing the data. What can I do to index the data properly and prepare it well?

linus · March 21, 2023, 11:18am

Think of sensible ways to split Question and Reply, think about some cases where the user asks two questions to the Service Desk and get two replies to avoid giving false answers. So if it is common to request account set-up and creation of a new E-Mail-Account you might geht mixed results if a user asks only for the first question.

I’d also would like to recommend to you the following video where a process is shown and explained, this one really helped me out a lot, so it might help you too: 5. OpenAI Embeddings API - Searching Financial Documents - YouTube

klcogluberk · March 21, 2023, 11:28am

Actually, I have a template in mind like this [title,category,content]. I will use the title and category when indexing. Can you think of anything to add to this

linus · March 21, 2023, 4:31pm

Oh actually one further thing came to my mind: Try to filter out all the words which are exchanged as a pleasantry before the actual question is proposed. So if someone Asks: “Hello Servicedesk, can you help me with the following issue:” try to take this out because if someone uses a similar phrase when doing the request with your tool you might get a higher cosine distance when this should not be the case!

Edit: Lower cosine distance

klcogluberk · March 22, 2023, 11:52am

I’m already using a structure like this: " Answer the question as accurately as possible in the context below:
Context: <embedding_content>" isn’t he doing the same thing as you want to say, did I get it wrong ?

linus · March 22, 2023, 12:26pm

Ah okay if you have only the content thats great, what I meant was regarding the training data, and how to prep it

I meant if you have a servicedesk request, some pepole might say “Hello I have a issue and would like your help”, to take this part out of the training data as it would not be relevant for the answer.

klcogluberk · March 22, 2023, 12:58pm

Oh, I get it.
The data will be generally clean because I’m creating it myself. Actually, that’s why I asked what I should pay attention to when creating the dataset. For example, should I keep the content of the embeds long or short, how should I index it, such as.

Tetramatrix · August 18, 2023, 8:07am

I also started a chatbot for a FAQ document.

My prompt : Answer the question based on the context below, and if the question can’t be answered based on the context, say "I don’t know."\n\nContext: {context}\n\n—\n\nQuestion: {question}\nAnswer:

context is were I insert my embeddings.

My embedding is a combined string: Title:…+Category:…+Question:…+Answer:…

I tried all available models but only davinci (the most expensive) is able to answer “I don’t know” if the question totally unrelated.

Thus its working but I’m not very satisfied. The answers are exact match of the FAQ. But maybe there is room for improvements? I also tried only a few questions mostly from the training data, not real world example of customer question to the helpdesk.

Topic		Replies	Views
I read about embeddings and I want to try it. How to start? Community embeddings , chatgpt , api	2	4829	August 11, 2023
Feeding data then ask questions about it API	1	1561	February 28, 2024
How to create FAQ on internal company data? API	6	4514	December 18, 2023
How to best prepare a FAQ document for embeddings Community embeddings , chatgpt	5	2396	August 18, 2023
Questions about the embedding-based chatbot API embedding	4	161	December 15, 2024

Preparing the dataset for embeddings

Related topics