Mixing embedding services?

My current RAG system uses the OpenAI embedding service along with the Chat completion service. I’ve read that for a RAG system, you can use a different embedding service than the LLM but you must not mix different embeddings in the same vector store. This makes sense

But isn’t the size of the embedding vector a critical component of the internal neural network? It’s the “height” of the matrix (and context window is the “width”), correct? So if an embedding service returns a vector that is a different size (esp larger), how could that possibly work?

1 Like

All correct so far

Different in size to what?

You cannot mix size nor model in an embedding retrieval process.

If you change either one you must create fresh embeddings for your entire corpus.

Different length. The OpenAI embedding service returns a vector of size 1,536. I am aware you should not mix embedding models in an embedding store. My question is if the OpenAI LLMs working internally with token embeddings of size 1,536, how is it possible that I could use a different embedding service (that uses an embedding vector that is a different length) with an OpenAI LLM? I would expect OpenAI’s internal neural network matrix manipulation needs to know a priori of the embedding vector size.

These things are independent.

You could use an OpenAi embeddings service but use Gemini for prompt completion.

If you are using a vector database retrieval search system, the output of that is not vectors, but the text results of top matches. The automatic injection you place is text, so embeddings-based search results can be language that any AI can understand about your products in the database, or can just be used in any other type of search product that doesn’t involve further AI, like searching this forum for the most similar questions.

If the question is "can I embed all my document chunks with 1536-dimension text-embedding-3-small and then compare that with the embeddings result I get from 1536-dimension text-embedding-ada-002 (or any other embeddings AI), the answer is a huge NO.

The reason for “No” is that every model is trained differently and learns different semantic aspects in its layers that are expressed in individual dimension activations. As a simplified depiction: Dimension 5 of the original model might show a strong correlation for text involving birds, but another model is bird-like elsewhere, and dimension 5 is strong in political things.

Yes, I know that. My question is “How does that actually work internally if the lengths of the vectors are different lengths?”

1 Like

Yes, I understand you should not mix embeddings in a vector store; that clearly is NG. And I understand you pass strings in your context to the LLM. But if an LLM’s neural net(s) is internally trained using a specific embedding algorithm, how is it possible that I can use a different embedding in my vector store and have the LLM give me reasonable responses? Or do all embeddings behave similarly enough?

If you actually did store vectors of different length in the same table this would be a surprise, or at least I would expect a error message at some point in the process of retrieving the closest matches.

Now, for your other question, it’s generally best practice to not mix different embedding models unless you embedd the same data twice for performance comparisons or as a backup in case one of the two models is not available.

If you wonder why the results between the two models do somewhat correlate, and I am assuming that you are querying two separate sets of embedding from different models, then I would take it as a hint that both models are somewhat equally capable and similar.

From a practical point of view, if there were two different embedding models and the resulting vectors for a large sample of texts correlate very strongly, then I would consider to use them interchangeably and save the costs for embedding twice. But I am not aware of any model combination that would allow for such a approach in production.

I clearly am not explaining my question too well.
I’ve had my RAG system now for almost a year. I understand the RAG architecture and VDB/DB concepts.

Another attempt at my question… If the LLM uses a specific embedding algorithm/service that uses an embedding vector of a specific size, how is it possible that my vector store can use a different embedding algorithm/service (which might use embedding vectors of a different length) and get reasonable answers from the LLM?

If my VDB (and my user prompt) use embedding-service-1, how is it possible to get reasonable answers if the LLM uses embedding-service-2 internally? Everything I’ve read said this would work, but how is it possible when the weights of the internal neural nets of the LLM have been trained on a different set of vectors?

1 Like

I think you are confused about the process.

the two stages are 100% independent, except for the fact that you are presumably handing off results from the local semantic search and adding it to your prompt passed to the Completions LLM.

the Completions LLM doesn’t care what you did locally or if you did or didn’t use another API by the same provider to get your embeddings, let alone if those embeddings were created using the same model as the Completions LLM.

the Completions LLM call has nothing to do with your embedding scheme, it is completely separate.

so you could, for example, manage your local vector representation of your corpus using Open AI embeddings API, but instead, if you want, call an Anthropic model to do your Chat Completions - the latter LLM won’t give a darn, just so long as you include the search results (in text) from your local search.

Yes, I understand this…

How and why does this work in a reasonable fashion?

The LLM is based on a neural network (or collection of NNs). That NN architecture uses token embeddings internally to train the model. Those token embeddings are created using a specific embedding algorithm. The embeddings I have in my VDB may have been created using a different embedding algorithm. I use that embedding service to retrieve text that I put into my prompt context (which I have been doing for a year btw).

I would like to understand how and why this works in a reasonable fashion. Are we all assuming all text embedding models behave in a similar way or at least similar-enough way for a RAG system?

All that internal operation of an AI language model that chats with you is abstracted away and cannot be accessed. You supply text as input, and text comes out.

2 Likes

Because you are extending the prompt with information that will help the bot to answer the query.

That doesn’t need to come from an embeddings process, you could have got that augmentation from a keyword lookup.

1 Like

Different types of hammers are used by different types of craftsmen for different purposes but all hammers are essentially the same.

Of course… for a proprietary system like OpenAI, that’s clearly true. But I want to know why and how it works so I can teach it.

The AI models used for semantic embedding and those for language inference that use embeddings against a hidden state are fundamentally trained and employed differently, and the two applications do not meet.

To get to a point of understanding where you can pose questions that make sense, you might want to enroll in the Stanford spring semester “Transformers AI” course or similar to understand how natural language processing AI is built.

With all due respect, the question makes perfect sense to me.
I understand the basics of transformers and I’m well-versed in classical ML.

My question relates to how can an embedding service that uses a completely different embedding algo be reliably used with an LLM that internally uses matrices that were trained using different token parsing and parameter manipulation algos. The transformer architecture applies weights into the neural net(s) in a powerful manner, but the fundamental infrastructure is matrix manipulation. That internal matrix manipulation needs to be aware of the dimensions of the matrix, which seems to be highly correlated to the embedding.

So if you use a different embedding service created using a different semantic encoding that physically uses different dimensions from the LLM, how can that possibly work optimally? Yes, an embedding service that you use for your VDB can retrieve similar text context for the LLM, but wouldn’t far better results occur if you use the same embedding service that the LLM is familiar with? I have read in several places that you can use a different embedding service for your VDB and initial prompt (clearly, you should not mix embeddings in the same VDB or same table/collection in the VDB).

If the LLM has some sort of embeddings plugin arch where you would match the embeddings semantic algos, then it makes sense, but I don’t think that’s currently available.

My original post was prob not phrased well.

It “works” because the only interface between an embeddings-based semantic search vector database of natural language text, and a natural language completion AI model (that has embeddings as part of its internal technology), is the text of natural language.

The vectors of embeddings are not being used in augmented generation to affect the model by changing internal weightings. Embeddings are used by comparisons to other embeddings vectors to discover useful text in a knowledge base that is relevant to the present task and user input, and then that text is placed into language context window of “prompt”.

1 Like

Emphasis mine.

Idea: can you show us an example where it does? I would be highly interested.

Good point. “Optimally” was not a good word for me to choose given we are talking about a non-deterministic technology. I was equating ‘optimally’ when the dimensions of the LLM model’s internal neural network matches the dimensions and algorithm of the embedding service.