Embedding tokens vs embedding strings?

From reading various docs, the GPT LLMs were trained on tokens (sub-words) using an initial embedding vector as the first layer of its internal neural network. That is, there’s an embedding vector for each token.

But in a RAG architecture, the OpenAI embeddings API returns an embedding for a string, ie, multi-tokens.

I must be missing something. How can you use embedding vector similarity for strings when the engine is based on tokens?

The embedding model takes in the text, and produces a vector for an entire chunk of text. You can embed over 8 thousand tokens at a time with ada-002 and get a vector. With open source models, you can, on average, embed 512 tokens at a time to get the vector.

Once you have these vectors, you correlate them, to find other similar vectors, which then have similar meaning in their underlying text. So you use the embeddings to find other similar chunks of text. So it’s a search engine.

For the model internals using embeddings, this is also true, to limit the size of matrices inside the model. Otherwise the matrix sizes would be too unwieldy and hard to train.

4 Likes

Thanks for the response Curt. Yes, I understand that about the embedded service, so why tokenize strings in the first place?

1 Like

Tokens are the internal representation of language in the AI model.

Tokenization is used in language model AI for a couple reasons:

  • it is a form of compression
  • it is a form of semantic representation
  • it is a type of efficiency

A word like " documentation" is fourteen 8-bit memory locations, but with tokenization, it can be reduced to one 32-bit memory location, along with enough space in the 32-bit domain to encode another billion unique tokens.

An entire word as token can also then have meaning behind it developed, without needing to assemble 14 characters into a string of meaning.

Other word embeddings can have over a million unique tokens trained for every word possibility known. They can map a similarity directly between any words.

Transformer language models instead use byte-pair encoding, a relatively efficient compression method (except for possibly unbounded buffer length needed) that uses less total token counts, and uses statistical analysis to find the best token representation to instead get near 100k tokens, which also allows long non-word sequences that are common, and arbitrary sequences can still be encoded by byte or multibyte tokens. 50% whole words, 50% words made of parts with semantic meaning still.

You don’t have to think about it except for when you measure before you send. Instead of 8000 characters max, you might be able to embed 30000 characters of English text - or drastically less if using multi-byte Unicode such as emoji or Chinese.

The tokens are integers that are then transformed inside the model to floating point vectors.

Tokens are the atomic units of the LLM. So you would think, why not just create a token out of each letter or character? You can, but while this has a small set of tokens, small vocabulary and really no OOV words (Out of Vocabulary), each character contains little to no information.

So training on characters leads to an information starved per token system.

On the other hand, you could have each word be its own token. Here you there is maximum information per word, but given al the misspellings, you end up with a massive set of tokens, and too much information per word, and lots of OOV words.

So the middle ground is sub-word or small word tokens. Here you have a relatively small set of tokens, a lower amount of OOV words, and the right balance of information per token.

GPT has gone from a a 50k token library, to now a 100k token library. More tokens is generally better, if you can handle it, so as time goes on, we may get massive million token tokenizers, but we aren’t there.

So why not just go straight to a vector? Well, you could map each word to it’s own vector, I have done this, and what happens is you get into the “large token” scenario, where there is too much information coming in for the network to handle. So you back off, and go to sub-word, similar to the tokenizers today.

You are trying to make a decision in the network with limited computing resources. So the tokens transform into vectors that are now in a continuous space, and in this space, close vectors have close meaning. So you are globbing meaning to localized chunks in the space, instead of each different thing has a dramatically different internal representation. You need many neurons to make sense of this, so you have to localize and linearize things to get it to work, and get down to a computable number of neurons.

Anyway, hope my rambling helps :rofl:

2 Likes

Thank you for that response. Highly appreciated.

So when I send a large prompt (eg, assistant/user history + private-RAG-docs + template + COT + et al), is the LLM determining the embedding vector for the entire large prompt and finding the most probabilistic sequence to complete the prompt, or it sequentially completing the prompt based on starting text sequence?

The AI processes the tokens of input as a whole one-by-one to update a hidden state and embedding vector weight. The AI then uses masked attention layers and such to determine a probability across its token dictionary. This is then sampled to determine the token to generate. And then all done again including the generated token to produce the next token of completion. So AI is not “sequentially completing the prompt” in the same manner as would generate tokens.

Embeddings is taking an internal state that has been assembled by “reading” the input against the vast learning, and then producing a vector in a way that has been tuned to highlight semantic matching (OpenAI’s methods are proprietary, and there’s no paper discussing the work they do to make embeddings model either specialized or to perform well across many possible uses).

Hi, I am still not quite sure i understood correctly - will I get an embedding vecotr per token or per sequence ? I have tried using the text-embedding-ada-002… but could not figure it out

You will get an array of floating point numbers (right now 1,536 of them) for every string. The string could be one letter or a very long set of paragraphs.

Thanks, I tried getting an embedding for one string and many, i saw that both resulted in an array of that length. Yet, is this array actually a list of vecotors ? I had the understanding that LLMs can only process vectors.

Current OpenAI embeddings models allow very large amounts of text to be evaluated at once for one result, similar to 4000 words of input. You might have smaller documents you are embedding, like a help knowledge base or even a user’s question, and also may want to chunk the information in small pieces.

An embeddings model returns a vector, aka a tensor. It is an array of numbers, the count of which is also called dimensions. Imagine this, with 3000 numbers:

['+0.04134851', '-0.02646899', '-0.01640780', '-0.03345239', ...]

That’s from an input “What is an OpenAI GPT useful for?”

That is certainly of little use by itself, but it can be used for comparing against other vectors, to obtain a semantic similarity score. Compare it against all your knowledge vectors, and you can find the best match.

The comparison is by local dot product math.

3 Likes

To clarify: that array is a vector (in a large dimensional space)

1 Like

Well… internally LLM’s process embedding vectors. Let’s take a step back… LLM’s are usually implemented as a neural network. This neural network has an input layer, an output layer, and has many, many layers in the middle. The layers in the middle are referred to as “hidden layers”. It’s computationally expensive to handle the hidden layers, so to offset some of this computing expense, the notion of an embedding vector that represents the middle layers was created. The OpenAI embedding service currently returns a 1-D array (ie, a “vector”) of floating point numbers of size 1536. You, as a developer, send text to the LLM, not the embedding vector. Typically when you want to use embedding vectors, you are building some sort of RAG application where you store embedding vectors and the string they represent (and other stuff). There are plenty of posts on RAG architecture here that you should follow. Hope this helps.