Are embeddings more than just tokenized text?

I’m getting into building RAG using Llamaindex but want to clarify my basic understanding of embeddings. When chunking and embedding content with text-embedding-ada-002 am I essentially just tokenizing the content, or do the embeddings also contain vector directionality to connect concepts semantically?

I’d like to enrich and index my content as much as possible prior to retrieval, so trying to get my head around whether better data = higher quality embeddings.

Embeddings are vectors that capture the semantics of the chunk of text.

To get this vector, the chunk is fed into the embedding model, then it gets tokenized. Once it’s tokenized inside the model, it starts traversing the structure in the model, almost as if it will make a next token prediction. But instead, it takes a snapshot of the model internals, usually the last layer (an array of numbers), and converts this to a vector capturing the meaning (just normalize the array to a unit vector).

That’s how I think about it mechanically.

But it gets the semantics because the chunk is squeezed through the model, and the model is intent on understanding the content so it can make its next token prediction. So it has understanding of the text chunk, because that understanding is required to make a next token prediction.

This is what you use to derive relations between text chunks … were they ever close at a specific reference point in the same embedding model? If they were, then they are likely similar.

As for “better data = better embeddings”. I’d say this is true in general. But I don’t worry too much about it. You could try normalizing the inputs, by removing extraneous spaces, and what not, and this will likely align all the inputs to a similar type of input, which should increase your correlation scores.

Just remember if you embed the same thing twice, you will get the same vector, within roundoff. If you add a space in front of the entire chunk, the tokens will be different, and so you get a different vector. This adds noise, unless that leading space is meaningful. Which is sometimes the case for code, but not for written words with meaning. So normalization will increase the SNR of the correlations, without altering the meaning, as long as your normalizations are meaning agnostic.

1 Like

I’d echo the “But I don’t worry too much about it.”

download
this is a test comparing 5 spacing variation of a story to 4 other stories, then the spacing variations againe each other, and then against the spacing variation 5 applied to story nr 5. spacing variation 5 is pretty extreme, with 10000 spaces inserted before, somewhere in the middle, and after the story.

What you can see from this is that the same spacing pattern slightly drives irrelevant information together (compare last item col 4 to last item col 10), and same information apart (col 0).

On the other hand

I would advise you that ADA only has a limited attention span - even though the context window is theoretically 4000 tokens long, it doesn’t have nearly enough attention to handle everything. If it gets overloaded, it has a clear preference for pertinent concepts at the start of the text, and to a lesser degree concepts at the end. There can be a complete blackout in the middle.

Also, it has very little (if any) self attention. so cross referencing concepts in text over longer distances doesn’t seem to work at all.

2 Likes

Thank you both! @curt.kennedy your mental model is super helpful there. I’m wondering then if the bulk of the work effective RAG is then on the retrieval side, rather than working to make the embeddings more semantically rich upfront?

For example, I see that with llamaindex (or maybe spacy-llm) I could run entity extraction, sentiment analysis etc. My use case is embedding well-structured news podcast transcripts (in json) so my gut feel is that better upfront indexing of places, events, people etc. should make for better retrieval. Then again, I’m not sure whether the embedding model would do anything to ‘connect’ these entities to their references in the transcripts.

I think you have to first think about your chunking strategy. You want your chunks to contain entire thoughts and not thought fragments. Remember, these chunks are retrieved, and then presented to the LLM in the prompt, so they need to make some sense.

If your chunks are too small, they likely won’t contain entire thoughts, but your embedding of this small chunk is very precise. If your chunks are too big, they contain lots of thought, and your embedding is less precise because of the varying amount of information in the big chunk.

And so this is where all the fun begins. How do you solve this problem? Or do you just pick a chunk size and percent overlap and call it good :rofl: The latter is what most people do, because that is what a lot of these automated chunking tools do.

But the rabbit hole just gets deeper. Why not chunk at various levels? Then with some amount of awareness of how these chunks are related to each other, you can formulate a more optimal chunk, by finding a large amount of smaller chunks within a single large chunk, and then retrieve the larger chunk that corresponds. So you get the best of both worlds, precise embeddings, and more cohesive thoughts per chunk. You can see where this is headed … it’s an optimization problem.

You could run sentiment, entity extraction too. You just need to figure out how to use this from a search perspective.

One thing related to this is keyword search. A simple one that I have implemented is a log normalized TF-IDF. This works best for your larger chunks. What you do is break down each chunk into a bag of lower cased words. Then you essentially count each word in the chunk. You also count how many total chunks each word occurs in. You can use this information to find relevant keywords, specific to your data. As the most important keywords are the ones that have the lowest repeats and lowest presence across all the chunks.

This keyword search leg can be combined with your embeddings, by ranking each stream separately and combining them with hybrid search fusion algorithms such as RRF or RSF (ref).

RRF is easier to implement, but RSF may be more precise, especially when there are multiple ambiguous correlations.

So, that’s pretty much it. There are a lot of variations to this theme. It’s certainly a fun and challenging problem to optimize. But, as you alluded to, it’s definitely more on the retrieval side IMO.