The length of the embedding contents

How long should the embedding contents be? How to decide on this length. I’m preparing the data set for embeddings Do you have any suggestions about the preparation of the data set and the contents?

2 Likes

Optimize for the smallest size without losing context.

Don’t just create chunks, your chunks should include context that might otherwise be lost.

You’ll want to plan on testing your strategy and adjust the inputs based on results. So if you are embedding a large corpus, you might want to start with a smaller sample for testing.

3 Likes

i don’t quite understand what you mean here, can you explain a little more please.

It can be complicated. I plan on publishing more about embedding strategies on my blog. But it’s a little more than I can share right now.

1 Like

Thanks. At least on your recommendation, I will proceed as follows : I will keep it short so as not to lose context by dividing it into sections (for example, a minimum of 40 tokens).

As you’re parsing, you’ll want to keep track of context that can be added to each chunk. Good luck!

@wfhbrian is fully right here. The problem of chunking is losing context in the chunk that might be relevant to retrieve this chunk afterwards. This context can be classified as global context (context that refers to the whole document that the chunk belongs to) or local context (context that is in the next/previous chunks to the current one, but still relevant for the current chunk). Let me give you an actual example:

  • Imagine that you have a whole transcript of a YT video that you want to embed so you can ask questions about it and produce accurate answers. Let’s say that the transcript can be broken up in 10 chunks of 4000 characters each. Let’s say that the name of the YT video is “How to use OpenAI’s API: my top-10 strategies”. Let’s just assume that you split the transcript in the naive way and then get 10 chunks with no additional context. Now, a user asks the following question: “What is the third strategy that was mentioned in the YT video "How to use OpenAI’s API: my top-10 strategies"? How does it relate to the two previous ones?

Let’s say that the answer to this question lies on your third chunk, in a fragment such as “Ok , folks, now let's go to the third one! One thing you need to consider is BLA BLA BLA...” and on the first and second chunks (as the first and second strategies are thoroughly discussed in there).

It is likely that your semantic search will not retrieve these chunks to answer the question and, therefore, you will not be able to respond accurately. This is because of two different problems:

  • Your chunks lack global context of the doc they belong to. As you only embedded a portion of the transcription for each one of them and did not include any doc metadata, in particular they do not know that they belong to the YT video “How to use OpenAI’s API: my top-10 strategies”. So the semantic search will not trigger a similarity regarding the YT video title.

  • Your chunks lack local context of the previous (and next) chunks. Your third chunk has absolutely zero information about the previous chunks. So even if it was retrieved, it doesn’t know how to compare the third strategy with the two previous ones.

There are a lot of different strategies to face these problems, and it’s a very relevant (and hard) research topic. Something that works very well for me as a starting point:

  • Consider adding some meaningful global metadata in all the chunks that you’re embedding at the beginning of them. Something like "Document's title: How to use OpenAI's API: my top-10 strategies.mp4. Document's author: <Youtuber_name>, etc..." will do the job. This way, you’re giving global context to all your embedded chunks.
  • Consider adding local context as regards the content of the previous chunks. For example: you can create a rolling-window summary of all the previous content and propagate it through the chunks so every chunk starts with some additional info about all the previous document. This is the tricky part, as the summary needs to be short enough but still maintain meaningful info about all the previous fragments.

Hope that helps!! :slight_smile:

22 Likes

I designed dataset like this [‘title’, ‘subtitle’, ‘content’]. I use title and subtitle to index the embeddings, and I use content as metadata to retrieve the contents of the embedding. I use embeddings to answer customer’s frequently asked questions, what else do you think can be added as a metadata? I am grateful for your answer.

I’m afraid I don’t follow you @klcogluberk. Content as metadata? Content is the actual content of each chunk, isn’t it?

By metadata I mean any other contextual information that is not the content of the chunk that might still be relevant to retrieve this chunk. And this totally depends on your use case. We’d need more details on the use case to provide you more specific advice on this regard :slight_smile:

How are you structuring the metadata? Are you nesting any objects or lists? Or are you just simply placing the answer inside? Are there categories in your knowledge base? Or separate products / services?

You can use single-stage filtering to improve your results if the metadata is organized well. It’s not essential. It may not even be useful for your case. At the very least it can give you an idea on how to structure your metadata correctly in the event that you decide that you want to start filtering the results.

You could consider using sparse embeddings as well for stronger search results if there are keywords that you’d like to focus on. They work side-by-side of your current dense embeddings.

I’d say it’s more valuable for product information though. Maybe there’s a use case for you as well? Off of the top of my head it could become essential if there are similar answers but for different categories, in which prioritizing keywords would be huge.

3 Likes

I would like to give you a complete summary of the use case.

1- I created my dataset myself with the company’s frequently asked questions.

2- My dataset consists of 3 columns [‘title’, ‘subtitle’, ‘content’]
title : General topic (eg Consultant Training) subtitle: Sub-topic (eg course fee)
content: Information document about the subject (eg. Consultant trainings start in the summer. Trainings are free bla bla bla).

3-I use title and subtitle to index embedding, and I store the content as meta_data to get to the content of the embed. I use the content as the text of the prompt context.

A simple use case
user question: Is counseling training paid?

The user question is sent to the vector database. Here, the embedding matching the highest score is taken. (Similarity score is calculated by cosine method).

The user question is sent with the prompt to the ChatCompletion API as follows.
“Answer the question as accurately as possible according to the context below.
Context: <embedding[‘content’]>
Q: A:”

Here the context text is the content in my dataset.

What I don’t understand is point 3. What do you mean by “I use title and subtitle to index embedding, and I store the content as meta_data to get to the content of the embed. I use the content as the text of the prompt context.“?

How is your actual call to the embedding endpoint? Can you share it with us?

If I were you, I’d def incorporate “title” and “subtitle” as global context of each chunk prior to embed it. I’d incorporate some other global metadata as well (timestamps? Author of each document? Short summary of the whole doc or at least key entities extracted via NER?)

The local context is a little bit more tricky, so I’d probably leave it for later :slight_smile:.

I believe they are embedding only the title and subtitle.

@klcogluberk You should be embedding the title, subtitle, and the content all together for semantic relevance.

The topic, sub-topic, and tags could also be embedded as sparse embeddings using a bag-of-words model, which is how you are trying to have it. Which is perfectly fine. As I mentioned, it makes sense to apply different weights to different keywords. You can also use these as filters if you wanted.

Oh gotcha. Yeah, in this case you want to embed the content as well, at least :slight_smile:.

I agree, sparse embeddings make sense in this context, but I’d go for a simpler approach first and just embed all the relevant content including chunk’s text + chunk’s metadata. You can always get more sophisticated from there, and weight different embeddings, include filters or even play with the different vector spaces, projections, compute different distances and weight them in a single similarity metric, etc. But yeah, the sparse embeddings + filters from the beginning makes perfect sense to me as well!

For sure. Perfecting a single method first is much more efficient than being half-way in multiple methods

1 Like

My embedding endpoint and I store them in the vectordb like this:
response = openai.Embedding.create( model=text-embedding-ada-002, input=dataset[‘content’] )
vector[‘id’]=dataset[‘title’]+dataset['subtitle]
vector[‘value’]= response[‘data’][0][‘embedding’]
vector[‘meta_data’] = dataset[‘content’]

vectordb.insert(vector=[vector[‘id’], vector[‘value’], vector[‘metadata’])

It seems that you’re only embedding the content of each chunk. As stated before, you should definitely consider embedding other stuff, such as your global metadata (title, subtitle and others)

actually, it makes sense for me to bury everything, but it was the example in this openai cookbook that confused me. they didn’t include the indexed columns the embedding. If you take a quick look, you’ll see.

Do you think this approach is right to include all the metadata in the embeds when creating the embeds:
text = df[“title”] + " " + df[“subtitle”] + " " + df[“content”]
openai.Embedding.create(input=text, model=‘text-embedding-ada-002’)

It’s better, but try to give some context about the metadata that you’re using. Something like:

text = f’Document title: {df[“title”]}. Document subtitle: {df[“subtitle”]}. Chunk content: {df[“content”]}.’

3 Likes