Why is it necessary for each chunk's embedding to be a weighted average when dealing with the embedding of long texts?

I have a question regarding the example provided in the following openai-cookbook.

Embedding texts that are longer than the model’s maximum context length

I am curious about the rationale behind utilizing a weighted average for each chunk’s embedding. I noticed there is a flag available to calculate this weighted average, with a default value of True. Consequently, it appears that the expectation is for each chunk’s embedding to be a weighted average.

I tried to find relevant information, but I have been unable to find an explanation for the necessity of this weighted average approach. While it might be evident to others, I am struggling to comprehend the underlying reasons for employing this process.

For instance, in scenarios where the token count is divided into 8191 and 8191, it is clear that the two vectors are simply split in half. Similarly, when dealing with token counts like 8191 and 4095, a weighted average can be applied to generate an averaged vector that captures the overall meaning. However, my confusion arises when considering cases like 8191 tokens versus a significantly smaller count, such as 100 tokens. In such instances, I am uncertain if the weighted average retains its significance.

Furthermore, I am not sure whether it is valid to solely rely on the token count as the basis for weighting the vector of embeddings in the first place.

Thank you.

I could not post it with a link with the following error:

An error occurred: Sorry, you can't include links in your posts.

So, this is the link to the example with spaces between ‘.’ and ‘/’:
github . com / openai / openai-cookbook / blob / main / examples / Embedding_long_inputs.ipynb

What is described doesn’t make much sense. “the two vectors are simply split in half”… etc

A very large segment of text embeddings (what you describe larger than you’d want to feed to an AI model) may have a high understanding of what the topic is about, but may have a low similarity asking about what appears in one line of documentation.

Split that large piece into smaller chunks. Spreading and blurring embeddings of that large piece? You’d likely get effects like better similarity matching on the overall topic with reduced similarity matching on the one that is relevant text.

There’s different subtle weights you could add to encourage or discourage some parts of text, but there is no set answer and often no way to tune without a human understanding or pre-preparation.

For example, you might have 100 chunks of baseball articles, but only 10 chunks of cricket articles. If you ask “how long a bat is”, one type of knowledge documentation could dominate the returns you get just by quantity, skewing AI answering.

However “what’s the latest news in sports today”, you’ve decided then to discourage the correct fraction that is baseball by the weights.

1 Like

Here’s a working link to the GitHub repository for reference.

1 Like

Thank you for your reply!

So, a long text needs to be split into chunks, its embeddings calculated, and a weighted average computed. This process aims to diminish the dominance of the lengthy text when compared to others. This weighted average approach is not for the long text itself, but for how the weighted average establishes a relationship with other texts. Is that correct?

I am still in the process of grasping the concept, but it has started making sense.

This is the first I’ve heard of such a technique, and some aspects do have merit, but the employment would take the creative insight and algorithmic skill of the implementer in knowing what they want to accomplish - with their own experience of how a simple similarity match is not returning ideal results on the types of inputs to be matched against the similarity of the data pieces.

I would start instead with data augmentation of the chunks, which can be by AI.

Father: Listen, lad. I built this kingdom up from nothing. When I started here, all there was was swamp. Other kings said I was daft to build a castle on a swamp, but I built it all the same, just to show 'em. It sank into the swamp. So, I built a second one. That sank into the swamp. So, I built a third one. That burned down, fell over, then sank into the swamp, but the fourth one… stayed up! And that’s what you’re gonna get, lad: the strongest castle in these islands.

That would be a lot better matched for similarity and also much better understood by an AI that might use the knowledge to answer with some additional metadata text inserted into to the chunk:

Monty Python and the Holy Grail - Movie Script
Chunk 5 of Scene 13, of 25 scenes total
keywords: Swamps, Castle, Sir Lancelot, Prince Herbert, Marriage, Singing
In Scene 13, Prince Herbert expresses his disinterest in inheriting his father’s kingdom and instead wants to pursue his passion for singing.

Thus no attempts at reweighting are needed to make the chunk vector colored by surrounding text.

Honestly, it really depends on what your goals are and what you want to accomplish with the embeddings.

When you’re averaging the parts of a larger embedding, the idea is to get a single embedding vector to more-or-less accurately reflect the semantic meaning of the whole.

A natural question to ask might be, “if the whole document is larger than 8k tokens, why would we care about retrieving the whole if we cannot fit it into context anyway?” And that’s a great question!

One potential case might be if the goal of the retrieval is simply to identify relevant documents as a whole in order to reference them. The theory might go something like, you embed the complete texts of a bunch of novels, then you might be able to retrieve the book based on a loose description of it.

E.g.:

User

It’s the book where Napoleon invades Russia and everything goes haywire, but it’s also about the high-society shenanigans of these Russian families who are somehow both super relatable and kind of hard to keep track of.

Embedding Retrieval

War and Peace

Another might be to do so as a first pass for retrieval—getting the most relevant whole documents before drilling down into the most relevant snippets. I am sure there are many others.

I haven’t seen many use-cases though where developers were interested in embedding entire documents[1], far more common is to chunk documents into logical pieces[2] and just do the retrieval from that.

With regard to the weighted average between embeddings with a different number of source tokens… I have mixed feelings about that personally because it implies there is a correlation between length and importance. There are plenty of scientific papers (for instance) where the most important bit is relatively small compared to the whole and it would likely be better to weight that bit much stronger than the rest.

At the end of the day embeddings and retrievals are fast-moving and dynamic areas of active research. All OpenAI is doing with this particular notebook is demonstrating an example of how you can generate what is effectively an embedding vector for a larger-than-context bit of text.


  1. Though I am sure there are plenty of them out there. ↩︎

  2. Often but not always with some overlap ↩︎

The code the OP is talking about is here:

It’s basically saying to overcome massive inputs (more than 8k tokens) you can slice it into chunks, embed the chunks, and correlate each input “sub-chunk” to your targets in a weighted average fashion. Weighted by tokens.

Most of the time, folks will not have MASSIVE 8k chunks coming in to correlate to their targets. But if you do, this is a great way to correlate them anyway and stay within the 8k window requirement.

This is only for inputs, and all targets are assumed to be sub-8k compliant.

So how do you correlate an incoming book to a series of smaller 8k chunks? You use this weighted average gizmo on the input book.

In the absence of massive inputs, just correlate as usual. The problem is usually in the other direction, which is the target document has to be chunked because it is too big. In this case, slice it in some chunk size determined by your target LLM buffer and expected input and output allocations.

So if you are using 8k GPT-4, and want a history 4k tokens, and new RAG content of 2k tokens, this gives you 2k tokens for the response (8k total). So with 2k of input RAG, how many chunks do you want max worst case? If you want 1 chunk only, then your determined chunk size is 2k tokens … for max 2 chunk worst case returns, you chunk with 1k.

So RAG/N = chunk size, where RAG + HISTORY + RETURN = Model Window Size, and N is the worst case highest number of chunks you must accommodate.

PS. The flags are in the provided wrapper code, and not in the API.

It’s also describing a technique of how to create a single vector and return the massive chunk sized beyond the embeddings (or language) AI capabilities, not as I assumed, considering the surrounding data in smaller chunking.


    if average:
        chunk_embeddings = np.average(chunk_embeddings, axis=0, weights=chunk_lens)
        chunk_embeddings = chunk_embeddings / np.linalg.norm(chunk_embeddings)  # normalizes length to 1
        chunk_embeddings = chunk_embeddings.tolist()
    return chunk_embeddings

Good for searching a database of PDFs and getting a 500k PDF returned to you. Not good for augmentation.

For some reason my mind gets it but my brain is refusing to put this into context.

If I have a trillion word document and I get a vector for each 8k chunk of that trillion and then normalise them, then weight them by token count, add them all up and divide by (1 trillion/8000)… that’s a valid vector?

Good point. It’s doing that too. Not sure if that’s advised simply because I don’t assume that the latent embedding space is linear like this.

What happens when I average a purse and a shirt?

I get a SHOE!

Not what I would expect, but oh well. I would avoid the vector output, unless there is a good known track record of success in doing this in this embedding space! :rofl:

Stick to correlate on meaning and averaging the scalers post correlation. That’s the only thing this is quasi-expected to perform at.

1 Like

Nope. Not valid. Think of this thought experiment … embed the entire contents containing the entire corpus of books, internet, blogs, etc. that the model was trained on, using the method in the Cookbook. This is the average of billions and billions of vectors, pointing in all sorts of random directions.

Now average this into one vector and normalize it back out to unit length. What is this vector? It’s GARBAGE!!! (Note, especially if everything is evenly distributed, the average vector is all zeros, so you are inverting the zeros vector!)

However, on the flip side, say you sliced the entire internet, and correlated each slice to your slices. What is the max weighted average of the correlation scalers? It’s the chunk of your data most correlated to all the internet, books, etc. Is this valid? YES it is 100% valid!

PS It would be cool to have the vector of the average of the entire internet … I’m seeing some NFT potential here … what is this vector worth to you, a million bucks? :rofl:

At 1536*64 bits for this vector, you are getting a steal at $10.17 per bit.

Thank you for all your reply!

Apologies, I believe I might be a bit perplexed. However, this is my interpretation of _j’s post.

Now, let’s consider the following sentences related to baseball and cricket.

Here are six sentences regarding a baseball bat:

The baseball bat, an iconic piece of sports equipment, embodies the essence of America’s pastime. Crafted with precision and care, it serves as a conduit between player and ball. These bats come in various sizes, typically ranging from 28 to 34 inches in length, tailored to suit individual players’ preferences and hitting styles. The size of the baseball bat plays a crucial role in determining a player’s swing mechanics and power potential. A longer bat offers extended reach, allowing hitters to cover more of the plate, while a shorter bat provides greater control. The selection of the right bat size is a meticulous process, reflecting the fusion of technology and tradition that defines the sport and empowers players to excel on the diamond.

Here are two sentences regarding a cricket bat:

The cricket bat, symbolizing tradition and finesse in the gentleman’s game, follows strict size parameters set by cricket laws, with a blade length not surpassing 38 inches and width of 4.25 inches. This ensures a delicate equilibrium between power and control, accommodating diverse playing styles and pitch conditions.

Imagine we calculate embeddings, sentence by sentence, as shown in the following image. Upon calculating embeddings for each sentence, it becomes apparent that the sentences related to baseball have a dominant presence in the outputs generated by the LLM.

Furthermore, it’s worth noting that not every sentence contains the term “baseball,” a similar situation observed in the context of “cricket.”

If we were to utilize the embeddings as they are, the outcome would likely be heavily skewed towards baseball-related content.

However, the introduction of a weighted average could serve to normalize the entire paragraph while also disseminating additional contextual information about baseball (or cricket) to each individual sentence.

image2

While I may be entertaining an unconventional thought, I’m struggling to grasp the reasoning behind the expected use of a weighted average in situations involving lengthy textual content.

Another aspect that perplexes me is the presence of a tutorial in the OpenAI documentation. Surprisingly, the tutorial does not employ weighted averaging for embeddings. Instead, it opts to segment lengthy text into shorter chunks, each containing fewer than 500 tokens. While the choice of the number 500 causes some confusion on me, this strategy merely involves breaking down extensive text into condensed segments, prompting me to wonder the disparities that might arise between these two approaches.

How to build an AI that can answer questions about your website

What is very obscure and takes intuition to understand is that the embeddings contains a representation of the internal state of the AI, how its mind-frame has been colored until then by the text it has received and relations from terabytes of training knowledge.

By using a completion engine, we can sort of tell what the AI is thinking about by the end of the baseball text, and make it continue writing with “The”:

The following guide will help you choose the right bat for your game.

How to Choose the Right Bat Size

The first step in choosing the right bat size is to determine your height and weight. The chart below provides a general guideline for selecting the right bat size based on height and weight. However, it is important to note that this chart is not a substitute for proper bat fitting. It is merely a starting point for selecting the right bat size.

So it reads about baseball bats, and wants to offer more about baseball bats. There’s a lot revealed that we aren’t ready to perceive: that it’s English, a composition, instructive, using inches…and countless unimaginable others activated, like New York vs Prague, mammals vs reptiles, Twitter vs Reddit. This ain’t no keyword search.

What about the third sentence alone?

The ball is stitched by hand and made of the highest quality materials. It is then carefully inspected to ensure it meets the highest standards of quality. The result is a perfect blend of tradition and innovation.
The Nike Ordem 4 is the official match ball of the 2016/17 Premier League season.

We’ve got a disconnect between the topic and the understanding. It’s colored by where the thought was left. Well, guess what? We’re less likely to get that filler sentence when connections remain about sporting equipment, making products, etc.

So let’s say we take a concept (easier to understand than internal weighting) that seeks to similarly makes the whole passage understood better: Take the paragraph’s embeddings vector. “Add” it to each of the sentence vectors, to give what we store about the chunk. Question: How well would relevant chunks be improved or degraded, and imagine the performance change when then we want only the best two about our question (that already scores differently on “inquisitiveness”)?

(AI knowledge amendment might instead be many paragraphs of chunk)

Thank you for sharing this information with me! I would appreciate some time to review and comprehend it before I get back to you. Thank you very much!

I already made the point that one could emphasize rarer material, or that more likely relevant but question context-poor.

I suggest that the topic with 100 chunks is not more “batty” than the one with 10 chunks, but rather, if we roll the dice (which is closer to the effect of comparing the similarity of a short question to a PDF chunk), the content with more chunks will be ten times more likely to be pulled out of the bag.

Putting a layer of sameness over the top a group of chunks could ensure only one document is ever picked, with the higher comprehension a complete overview would give possibly favoring the larger one, while obscuring the individual contents.

There’s no correct answer here, and a lot of manual intervention and iteration to see what works. Also a lot of human-only classifications, unless you want to train a reinforcement learning neural net on the best vector weighing mix to satisfy a corpus of answers.

1 Like