The length of the embedding contents

I’m afraid I don’t follow you @klcogluberk. Content as metadata? Content is the actual content of each chunk, isn’t it?

By metadata I mean any other contextual information that is not the content of the chunk that might still be relevant to retrieve this chunk. And this totally depends on your use case. We’d need more details on the use case to provide you more specific advice on this regard :slight_smile:

How are you structuring the metadata? Are you nesting any objects or lists? Or are you just simply placing the answer inside? Are there categories in your knowledge base? Or separate products / services?

You can use single-stage filtering to improve your results if the metadata is organized well. It’s not essential. It may not even be useful for your case. At the very least it can give you an idea on how to structure your metadata correctly in the event that you decide that you want to start filtering the results.

You could consider using sparse embeddings as well for stronger search results if there are keywords that you’d like to focus on. They work side-by-side of your current dense embeddings.

I’d say it’s more valuable for product information though. Maybe there’s a use case for you as well? Off of the top of my head it could become essential if there are similar answers but for different categories, in which prioritizing keywords would be huge.

1 Like

I would like to give you a complete summary of the use case.

1- I created my dataset myself with the company’s frequently asked questions.

2- My dataset consists of 3 columns [‘title’, ‘subtitle’, ‘content’]
title : General topic (eg Consultant Training) subtitle: Sub-topic (eg course fee)
content: Information document about the subject (eg. Consultant trainings start in the summer. Trainings are free bla bla bla).

3-I use title and subtitle to index embedding, and I store the content as meta_data to get to the content of the embed. I use the content as the text of the prompt context.

A simple use case
user question: Is counseling training paid?

The user question is sent to the vector database. Here, the embedding matching the highest score is taken. (Similarity score is calculated by cosine method).

The user question is sent with the prompt to the ChatCompletion API as follows.
“Answer the question as accurately as possible according to the context below.
Context: <embedding[‘content’]>
Q: A:”

Here the context text is the content in my dataset.

What I don’t understand is point 3. What do you mean by “I use title and subtitle to index embedding, and I store the content as meta_data to get to the content of the embed. I use the content as the text of the prompt context.“?

How is your actual call to the embedding endpoint? Can you share it with us?

If I were you, I’d def incorporate “title” and “subtitle” as global context of each chunk prior to embed it. I’d incorporate some other global metadata as well (timestamps? Author of each document? Short summary of the whole doc or at least key entities extracted via NER?)

The local context is a little bit more tricky, so I’d probably leave it for later :slight_smile:.

I believe they are embedding only the title and subtitle.

@klcogluberk You should be embedding the title, subtitle, and the content all together for semantic relevance.

The topic, sub-topic, and tags could also be embedded as sparse embeddings using a bag-of-words model, which is how you are trying to have it. Which is perfectly fine. As I mentioned, it makes sense to apply different weights to different keywords. You can also use these as filters if you wanted.

Oh gotcha. Yeah, in this case you want to embed the content as well, at least :slight_smile:.

I agree, sparse embeddings make sense in this context, but I’d go for a simpler approach first and just embed all the relevant content including chunk’s text + chunk’s metadata. You can always get more sophisticated from there, and weight different embeddings, include filters or even play with the different vector spaces, projections, compute different distances and weight them in a single similarity metric, etc. But yeah, the sparse embeddings + filters from the beginning makes perfect sense to me as well!

For sure. Perfecting a single method first is much more efficient than being half-way in multiple methods

1 Like

My embedding endpoint and I store them in the vectordb like this:
response = openai.Embedding.create( model=text-embedding-ada-002, input=dataset[‘content’] )
vector[‘id’]=dataset[‘title’]+dataset['subtitle]
vector[‘value’]= response[‘data’][0][‘embedding’]
vector[‘meta_data’] = dataset[‘content’]

vectordb.insert(vector=[vector[‘id’], vector[‘value’], vector[‘metadata’])

It seems that you’re only embedding the content of each chunk. As stated before, you should definitely consider embedding other stuff, such as your global metadata (title, subtitle and others)

actually, it makes sense for me to bury everything, but it was the example in this openai cookbook that confused me. they didn’t include the indexed columns the embedding. If you take a quick look, you’ll see.

Do you think this approach is right to include all the metadata in the embeds when creating the embeds:
text = df[“title”] + " " + df[“subtitle”] + " " + df[“content”]
openai.Embedding.create(input=text, model=‘text-embedding-ada-002’)

It’s better, but try to give some context about the metadata that you’re using. Something like:

text = f’Document title: {df[“title”]}. Document subtitle: {df[“subtitle”]}. Chunk content: {df[“content”]}.’

2 Likes

Thanks, I’m understand better now. Actually, we have to provide a context when we do the embeddings in order for it to be meaningful. I guess, allows us to get more accurate results when associating embeddings with the user question. I think I can use embeddings better now. Do you think I should try to keep the embeddings length short or should I try to keep it long?

Happy to help. It’s a very good question.
Quick rule of thumb: 4000 characters.
Real solution: it fully depends on your use case, your documents and the questions that your app faces. I’m usually happy with the results that I get with a two-steps semantic search, that I explained in this link. Hope it helps!

1 Like

I’ve been thinking of embedding a bunch of books, mostly non-technical advice books, or philosophy books. This would be a side-project, and not work related for me.

Based on the conversations in this thread and others, here is my embedding strategy that I’ve come up with:

  • Embed every 3 paragraphs, slide one paragraph at a time (so ~66% overlap) Or maybe make all chunks disjoint to make things easier later.
  • Each idea is contained in at most 3 paragraphs (~500 tokens)
  • Each embedding has metadata on starting paragraph number, ending paragraph number (used later to de-overlap and coherentize)
  • Could also contain metadata on Chapter / Author / Page, etc., but really need TITLE so as not to mix books if I need to coherently stick adjacent chunks together. If I go with disjoint non-overlapping chunks, this doesn’t matter so much.
  • I would not mix the metadata in the embedding, have it as separate data and retrieve it for the prompt to examine if necessary, because of the thought:
  • Don’t contaminate the embedding with the metadata, only embed ideas and content, keep metadata separate in the DB.. I don’t plan on querying on the author/title, that’s the main reason for me. It fun to see what pops up, and the metadata will be available in prompt, since I can return the adjacent metadata, but it won’t be directly embedded.

So here’s my next thought, since GPT-4 has at a minimum 8k context. I was wondering if I should embed more at once, maybe 6 paragraphs at 33% overlap?

It’s going to be trial and error.

Then I am going to hook this up to my personal assistant SMS network that I’ve built, so I can use it anywhere in the world from my cell phone.

1 Like

This all looks good to me. I am genuinely interested as well. My biggest concern is the large overlap.

I have been trying to find a way to dynamically decide when to cut off content for embeddings - which is tough because the content needs to be embedded before the computer can perform analytics on it.

I recently was experimenting by embedding the items in paragraphs, and then applying clustering algorithms to compare how far away each item is from each other. I haven’t spent too much time and effort on it, and in the time I was doing it, it all was manual, so it remains mainly unconfirmed for accuracy.

@RonaldGRuckus

The large overlap is mainly driven by not wanting to chop an idea in half, but admittedly, it’s probably excessive. If I go to bigger chunks, this is less likely, and I can get away with less overlap, but now I have larger prompts and less localization for the embedding to characterize. So there is certainly a trade in overlap and chunk size.

But also laziness is a factor, and what if I did huge chunks that were disjoint, call it good, and not overthink it. I will probably start there, especially since I plan in using GPT-4 for this, and it has the larger context window.

I thought about using overlaps but never did it

Out initial embedding size was 10 to 15 perfect of the total context for the model. We did this because we wanted to combine multiple contexts in a single query.

It has worked well. We did use a bigger percent for a specific case. It gave better results but required us to iterate through contexts

We didn’t embed meta data and kept it linked to the embedding record via a pointer

If you go for 10 to 15 percent, then pick out the top 5 or 6 hits. Then you can sort them into the order they appeared in the original text. Sometimes this would result in adjacent blocks being put back together and solved the overlap issue so we didn’t revisit it

2 Likes

@raymonddavey

I like your idea of embedding as a percent of the expected input prompt. So if I want 5 different contexts, I just embed at 20% of my max input tokens.

BTW, how are you estimating tokens? Use tiktoken or back-of-the-envelope (BOE) estimations? I saw you did your own C# tokenizer.

Right now, I tend to use BOE estimates at the character level, since the Byte-Pair encoder seems to chunk characters together.

I ended up using my own tokenizer. But to be honest I used the wrong one initially and it didn’t make much difference

So in the end you could say I did a back of envelop measure.

However when I did the actual embedding I stored the token count t returned and used that when I used them to build prompts in the future

So sometimes I got 5 or 6 contexts in the final prompt depending on their confirmed length after embedding

2 Likes