The length of the embedding contents

Oh gotcha. Yeah, in this case you want to embed the content as well, at least :slight_smile:.

I agree, sparse embeddings make sense in this context, but I’d go for a simpler approach first and just embed all the relevant content including chunk’s text + chunk’s metadata. You can always get more sophisticated from there, and weight different embeddings, include filters or even play with the different vector spaces, projections, compute different distances and weight them in a single similarity metric, etc. But yeah, the sparse embeddings + filters from the beginning makes perfect sense to me as well!

For sure. Perfecting a single method first is much more efficient than being half-way in multiple methods

1 Like

My embedding endpoint and I store them in the vectordb like this:
response = openai.Embedding.create( model=text-embedding-ada-002, input=dataset[‘content’] )
vector[‘id’]=dataset[‘title’]+dataset['subtitle]
vector[‘value’]= response[‘data’][0][‘embedding’]
vector[‘meta_data’] = dataset[‘content’]

vectordb.insert(vector=[vector[‘id’], vector[‘value’], vector[‘metadata’])

It seems that you’re only embedding the content of each chunk. As stated before, you should definitely consider embedding other stuff, such as your global metadata (title, subtitle and others)

actually, it makes sense for me to bury everything, but it was the example in this openai cookbook that confused me. they didn’t include the indexed columns the embedding. If you take a quick look, you’ll see.

Do you think this approach is right to include all the metadata in the embeds when creating the embeds:
text = df[“title”] + " " + df[“subtitle”] + " " + df[“content”]
openai.Embedding.create(input=text, model=‘text-embedding-ada-002’)

It’s better, but try to give some context about the metadata that you’re using. Something like:

text = f’Document title: {df[“title”]}. Document subtitle: {df[“subtitle”]}. Chunk content: {df[“content”]}.’

2 Likes

Thanks, I’m understand better now. Actually, we have to provide a context when we do the embeddings in order for it to be meaningful. I guess, allows us to get more accurate results when associating embeddings with the user question. I think I can use embeddings better now. Do you think I should try to keep the embeddings length short or should I try to keep it long?

Happy to help. It’s a very good question.
Quick rule of thumb: 4000 characters.
Real solution: it fully depends on your use case, your documents and the questions that your app faces. I’m usually happy with the results that I get with a two-steps semantic search, that I explained in this link. Hope it helps!

1 Like

I’ve been thinking of embedding a bunch of books, mostly non-technical advice books, or philosophy books. This would be a side-project, and not work related for me.

Based on the conversations in this thread and others, here is my embedding strategy that I’ve come up with:

  • Embed every 3 paragraphs, slide one paragraph at a time (so ~66% overlap) Or maybe make all chunks disjoint to make things easier later.
  • Each idea is contained in at most 3 paragraphs (~500 tokens)
  • Each embedding has metadata on starting paragraph number, ending paragraph number (used later to de-overlap and coherentize)
  • Could also contain metadata on Chapter / Author / Page, etc., but really need TITLE so as not to mix books if I need to coherently stick adjacent chunks together. If I go with disjoint non-overlapping chunks, this doesn’t matter so much.
  • I would not mix the metadata in the embedding, have it as separate data and retrieve it for the prompt to examine if necessary, because of the thought:
  • Don’t contaminate the embedding with the metadata, only embed ideas and content, keep metadata separate in the DB.. I don’t plan on querying on the author/title, that’s the main reason for me. It fun to see what pops up, and the metadata will be available in prompt, since I can return the adjacent metadata, but it won’t be directly embedded.

So here’s my next thought, since GPT-4 has at a minimum 8k context. I was wondering if I should embed more at once, maybe 6 paragraphs at 33% overlap?

It’s going to be trial and error.

Then I am going to hook this up to my personal assistant SMS network that I’ve built, so I can use it anywhere in the world from my cell phone.

2 Likes

This all looks good to me. I am genuinely interested as well. My biggest concern is the large overlap.

I have been trying to find a way to dynamically decide when to cut off content for embeddings - which is tough because the content needs to be embedded before the computer can perform analytics on it.

I recently was experimenting by embedding the items in paragraphs, and then applying clustering algorithms to compare how far away each item is from each other. I haven’t spent too much time and effort on it, and in the time I was doing it, it all was manual, so it remains mainly unconfirmed for accuracy.

@RonaldGRuckus

The large overlap is mainly driven by not wanting to chop an idea in half, but admittedly, it’s probably excessive. If I go to bigger chunks, this is less likely, and I can get away with less overlap, but now I have larger prompts and less localization for the embedding to characterize. So there is certainly a trade in overlap and chunk size.

But also laziness is a factor, and what if I did huge chunks that were disjoint, call it good, and not overthink it. I will probably start there, especially since I plan in using GPT-4 for this, and it has the larger context window.

I thought about using overlaps but never did it

Out initial embedding size was 10 to 15 perfect of the total context for the model. We did this because we wanted to combine multiple contexts in a single query.

It has worked well. We did use a bigger percent for a specific case. It gave better results but required us to iterate through contexts

We didn’t embed meta data and kept it linked to the embedding record via a pointer

If you go for 10 to 15 percent, then pick out the top 5 or 6 hits. Then you can sort them into the order they appeared in the original text. Sometimes this would result in adjacent blocks being put back together and solved the overlap issue so we didn’t revisit it

2 Likes

@raymonddavey

I like your idea of embedding as a percent of the expected input prompt. So if I want 5 different contexts, I just embed at 20% of my max input tokens.

BTW, how are you estimating tokens? Use tiktoken or back-of-the-envelope (BOE) estimations? I saw you did your own C# tokenizer.

Right now, I tend to use BOE estimates at the character level, since the Byte-Pair encoder seems to chunk characters together.

I ended up using my own tokenizer. But to be honest I used the wrong one initially and it didn’t make much difference

So in the end you could say I did a back of envelop measure.

However when I did the actual embedding I stored the token count t returned and used that when I used them to build prompts in the future

So sometimes I got 5 or 6 contexts in the final prompt depending on their confirmed length after embedding

2 Likes

@raymonddavey

Genius, since I plan on using ada-002 for embedding, just pull the value from that and put it in the database next to the embedding vector! GPT-4 probably even uses the same tokenizer!

2 Likes

I can confirm they are the same tokenizer

And that is what I do too

But now my c# tokenizer gets me within a few tokens. They seem to have an overhead for each message in the new chat protocol

1 Like

This looks really amazing!
You saying this:

“I have a classifier that determines if the question is a “general” or a “specific” question.”

Is there any document or sample that you can suggest for me to do something like this classifier? I use title and subtitle when indexing embeddings to the vector database, do you think this makes sense? Are indexes important for making semantic search if not, giving a unique random id may be the solution.

it may be necessary to use metadata in the prompt, but when creating embeddings, if there is important data in the metadata that can dealing the correct embedding with the user question, I think you should include some important metadata in the embeddings. Because I think that in order to dealing the question with the embedding, that metadata must have been included in the embedding.

GPT wrote code like this:

# Function to split a DataFrame into chunks based on the text column length
def split_dataframe(df: pd.DataFrame, window_size: int = 2000, overlap: int = 1000) -> List[pd.DataFrame]:
    chunks = []

    for _, row in df.iterrows():
        text = row['text']
        start = 0

        while start < len(text):
            end = start + window_size

            # Move end to the right until whitespace is encountered
            while end < len(text) and not text[end].isspace():
                end += 1

            chunk = text[start:end]
            if len(chunk) > 0:
                chunks.append(chunk)

            # Move start to the right by window_size - overlap
            start += window_size - overlap
            # Move start to the right until whitespace is encountered
            while start < len(text) and not text[start - 1].isspace():
                start += 1

    return chunks