The length of the embedding contents

Thanks, I’m understand better now. Actually, we have to provide a context when we do the embeddings in order for it to be meaningful. I guess, allows us to get more accurate results when associating embeddings with the user question. I think I can use embeddings better now. Do you think I should try to keep the embeddings length short or should I try to keep it long?

Happy to help. It’s a very good question.
Quick rule of thumb: 4000 characters.
Real solution: it fully depends on your use case, your documents and the questions that your app faces. I’m usually happy with the results that I get with a two-steps semantic search, that I explained in this link. Hope it helps!

1 Like

I’ve been thinking of embedding a bunch of books, mostly non-technical advice books, or philosophy books. This would be a side-project, and not work related for me.

Based on the conversations in this thread and others, here is my embedding strategy that I’ve come up with:

  • Embed every 3 paragraphs, slide one paragraph at a time (so ~66% overlap) Or maybe make all chunks disjoint to make things easier later.
  • Each idea is contained in at most 3 paragraphs (~500 tokens)
  • Each embedding has metadata on starting paragraph number, ending paragraph number (used later to de-overlap and coherentize)
  • Could also contain metadata on Chapter / Author / Page, etc., but really need TITLE so as not to mix books if I need to coherently stick adjacent chunks together. If I go with disjoint non-overlapping chunks, this doesn’t matter so much.
  • I would not mix the metadata in the embedding, have it as separate data and retrieve it for the prompt to examine if necessary, because of the thought:
  • Don’t contaminate the embedding with the metadata, only embed ideas and content, keep metadata separate in the DB.. I don’t plan on querying on the author/title, that’s the main reason for me. It fun to see what pops up, and the metadata will be available in prompt, since I can return the adjacent metadata, but it won’t be directly embedded.

So here’s my next thought, since GPT-4 has at a minimum 8k context. I was wondering if I should embed more at once, maybe 6 paragraphs at 33% overlap?

It’s going to be trial and error.

Then I am going to hook this up to my personal assistant SMS network that I’ve built, so I can use it anywhere in the world from my cell phone.

8 Likes

This all looks good to me. I am genuinely interested as well. My biggest concern is the large overlap.

I have been trying to find a way to dynamically decide when to cut off content for embeddings - which is tough because the content needs to be embedded before the computer can perform analytics on it.

I recently was experimenting by embedding the items in paragraphs, and then applying clustering algorithms to compare how far away each item is from each other. I haven’t spent too much time and effort on it, and in the time I was doing it, it all was manual, so it remains mainly unconfirmed for accuracy.

@RonaldGRuckus

The large overlap is mainly driven by not wanting to chop an idea in half, but admittedly, it’s probably excessive. If I go to bigger chunks, this is less likely, and I can get away with less overlap, but now I have larger prompts and less localization for the embedding to characterize. So there is certainly a trade in overlap and chunk size.

But also laziness is a factor, and what if I did huge chunks that were disjoint, call it good, and not overthink it. I will probably start there, especially since I plan in using GPT-4 for this, and it has the larger context window.

1 Like

I thought about using overlaps but never did it

Out initial embedding size was 10 to 15 perfect of the total context for the model. We did this because we wanted to combine multiple contexts in a single query.

It has worked well. We did use a bigger percent for a specific case. It gave better results but required us to iterate through contexts

We didn’t embed meta data and kept it linked to the embedding record via a pointer

If you go for 10 to 15 percent, then pick out the top 5 or 6 hits. Then you can sort them into the order they appeared in the original text. Sometimes this would result in adjacent blocks being put back together and solved the overlap issue so we didn’t revisit it

5 Likes

@raymonddavey

I like your idea of embedding as a percent of the expected input prompt. So if I want 5 different contexts, I just embed at 20% of my max input tokens.

BTW, how are you estimating tokens? Use tiktoken or back-of-the-envelope (BOE) estimations? I saw you did your own C# tokenizer.

Right now, I tend to use BOE estimates at the character level, since the Byte-Pair encoder seems to chunk characters together.

1 Like

I ended up using my own tokenizer. But to be honest I used the wrong one initially and it didn’t make much difference

So in the end you could say I did a back of envelop measure.

However when I did the actual embedding I stored the token count t returned and used that when I used them to build prompts in the future

So sometimes I got 5 or 6 contexts in the final prompt depending on their confirmed length after embedding

3 Likes

@raymonddavey

Genius, since I plan on using ada-002 for embedding, just pull the value from that and put it in the database next to the embedding vector! GPT-4 probably even uses the same tokenizer!

3 Likes

I can confirm they are the same tokenizer

And that is what I do too

But now my c# tokenizer gets me within a few tokens. They seem to have an overhead for each message in the new chat protocol

2 Likes

This looks really amazing!
You saying this:

“I have a classifier that determines if the question is a “general” or a “specific” question.”

Is there any document or sample that you can suggest for me to do something like this classifier? I use title and subtitle when indexing embeddings to the vector database, do you think this makes sense? Are indexes important for making semantic search if not, giving a unique random id may be the solution.

1 Like

it may be necessary to use metadata in the prompt, but when creating embeddings, if there is important data in the metadata that can dealing the correct embedding with the user question, I think you should include some important metadata in the embeddings. Because I think that in order to dealing the question with the embedding, that metadata must have been included in the embedding.

1 Like

GPT wrote code like this:

# Function to split a DataFrame into chunks based on the text column length
def split_dataframe(df: pd.DataFrame, window_size: int = 2000, overlap: int = 1000) -> List[pd.DataFrame]:
    chunks = []

    for _, row in df.iterrows():
        text = row['text']
        start = 0

        while start < len(text):
            end = start + window_size

            # Move end to the right until whitespace is encountered
            while end < len(text) and not text[end].isspace():
                end += 1

            chunk = text[start:end]
            if len(chunk) > 0:
                chunks.append(chunk)

            # Move start to the right by window_size - overlap
            start += window_size - overlap
            # Move start to the right until whitespace is encountered
            while start < len(text) and not text[start - 1].isspace():
                start += 1

    return chunks

Thank you for sharing. Would love to hear some feedback on if this strategy works well once you built it

You should prompt something like this:

This code got evaluated for a score of 3.25 can you enhance that to 7.5 please?

[the code]

66% overlap? Isn’t that too much? Wouldn’t 33% be enough?
How much did it cost you to “embed” your books with your technique? I can’t find any data on this, can you please give us more details?

Assuming the average book has 100,000 words, this is 133,333 tokens. Embedding costs $0.0001 per 1000 tokens (with ada-002 current pricing). There are roughly 133 different 1000 token chunks in a book. So the cost is 133*0.0001 or roughly $0.01.

A 50% overlap would double this to $0.02.

The cost is nothing!

You will spend more on database and server costs. But you can still get this down to a few cents (or a few bucks, depending on usage) per month per book if you do it smartly and avoid expensive vector databases.

1 Like

Hey guys,

Just sharing my strategy on this (maybe saving someone’s day):

  1. Get raw text as a string (full text)
    1.1 Normalize the string (fix spaces, line ends, remove empty lines etc.)
  2. Split in chunks on lines ending with sentence stop punctuation (. ! ? ." !" ?") using regex (this way you’re most likely to chop text at the actual paragraph end, very needed especially if dealing with PDFs copy paste)
    2.1 check the chunk length if over the model cap, try to split it on sentence end punctuation.
    3 Using fine-tuned model, run reach chunk through a “formatter” - the goal is to make sure each title and list item are on their own lines (to separate them from simple paragraphs)
    4 join the chunks back again and split on line ends to get each line separately.
    5 run reach line through a fine-tuned classifier to determine if the line is a:
  • title
  • list item
  • paragraph
  • document meta (page number, doc date, version etc,)
    6 starting from the first classified line towards the end apply simple algorithm:
  • start with a new section if the line is a title or document meta,
  • add the current line to the current section of it is paragraph or list item, start new section if not.
    At this point you will end up with logical sections that either start with its title or have no title and describe the document (doc meta)
    7 Check if the section fits your target size for embedding/retrieval or needs further split. You may split them same way as in #2
    8 embed the section (or its part) along with the title and ID/NUMBER

When finding the sections, use the ID/NUMBER to find adjacent sections if need wider context and it fits into your answering model prompt.

3 Likes

This is my approach: https://youtu.be/w_veb816Asg

And, not only am I including metadata in my embeddings, but I also generate questions that each document answers, to sort of “spike” the contextual intent of the content.

For me, 2500 character chunks has been working, but as @AgusPG says, it really, really depends on your use case, type of documents, anticipated questions, desired responses, etc…

Good luck!

This is VERY good advice, because now you are chunking based upon the semantic structure of your source document, as opposed to arbitrary cuts in your document based upon chunk size. This is guaranteed to give you much better results.