Embeddings from multiple providers?

andersonmg · March 2, 2024, 3:15pm

Is it possible to generate compatible embeddings from different providers? For example, using the OpenAI model as a priority but having another provider (AWS Bedrock, v.g.) just in case of OpenAI fault. Can the embeddings be interchangeable or convertible?

curt.kennedy · March 2, 2024, 4:26pm

You can use the other embedding models and combine the correlation results either coherently or non-coherently. But you can’t correlate vectors between two different models directly, as the vectors are likely not even remotely correlated, and the resulting correlations are garbage.

But the idea of using multiple models and combining the results, and if there is an outage in one or more models, you still have the other models as backup, is a very good idea.

I can offer more details if you want, but that’s the high level answer.

andersonmg · March 2, 2024, 5:34pm

Please, can you give us the details about your idea?

sps · March 2, 2024, 5:50pm

That’s a nice approach to make sure that semantic search is always available.

If I understand correctly, this would involve obtaining and storing embeddings for the every chunk of data from multiple independent sources/models.

andersonmg · March 2, 2024, 6:09pm

Exactly! In my use case I will persist each piece of source information and the respective embedding.

sps · March 2, 2024, 7:01pm

In that case you can simply run an alternative semantic search with the corresponding alternative embeddings for a query if the main embedding provider goes down.

Just make sure to set proper threshold as it may vary with the embedding provider.

N2U · March 2, 2024, 7:14pm

As previously mentioned by @curt.kennedy, you can’t really get any meaningful results from calculating the distance between embeddings from different models. But it is possible to use different models for different parts of your system, some stuff may be handled by an open source model running locally.

In any case, I’ll recommend having a look at the Massive Text Embedding Benchmark (MTEB) Leaderboard over here:

curt.kennedy · March 2, 2024, 8:32pm

TL;DR Here is a lot of math describing the coherent combination approach in a fault tolerant way. Hopefully the notation isn’t too confusing, but it is used to be concise.

Yes, in my implementation, each chunk has N embedding vectors associated with it, and one Bag of Words (BoW) mapping for the keyword leg.

All the semantic legs can be combined coherently, which isn’t often talked about, as I see most people combining non-coherently with RRF or RSF. And there is some nuance if you want to combine “coherently” with the keywords, which I will explain below.

So on just the semantic side with embedding vectors, and N models, here is what you do:

Input text comes in, embed this with the N engines, in parallel. Get out N vectors of varying dimensions. The idea here is with coherent combination, you are synthesizing a massive embedding vector, and using this virtual high dimensional vector as your embedding.

For example, suppose you are using 10 embedding models, five (5) of them are of dimension 1024. Three (3) of them are of dimension 512, one (1) is of dimension 3072, and one (1) is of dimension 1536. So here is how you coherently combine these, in a fault tolerant way, that also synthesizes a vector of dimension 11,264, which contains more information than any individual model. And if some subset of embedding models go down, you get a synthesized vector of less dimensions, but they are also still combined coherently.

So for some notation, the input chunk X is mapped to 10 different vectors, E_1^X, E_2^X, ..., E_{10}^X.

You then take each of these E_i^X vectors, and correlate them to your target embeddings in your knowledge store. Let \Omega_i represent the collection of knowledge store vectors for embedding model i. Also, to give the final correlation some breathing room, form the set of top K correlations, say 10 \le K \le 50 might be a good place to start.

So if you want only the top match, you form this correlation, and get:

C_{i,K} = Top_K( \{E_i^X \cdot y \mid \ y \in \ \Omega_i \})

This should be a number between -1 and 1, assuming each embedding model consists of unit vectors, which is standard for embedding models these days, but know your model and pre-scale to unit length if it’s not for a particular model. Also save off the text behind the y here since this could be used in your RAG prompt. So you will get K chunks of text from your knowledge store from each of these.

OK, great. Now that you have the max correlations against your knowledge set, for each model, you combine these using a weighted average. You weighted average coefficients are:

\alpha_i, where \sum_{i=1}^N \alpha_i = 1

You would adjust the values of \alpha to weight certain models over the others, or you could weight them all equally by setting \alpha_i = 1/N.

OK, cool, so now your coherent combination is:

\rho_{i,k} = \sum_{i=1}^N \alpha_i * C_{i,k}

You will have N*K of these. So with 10 models and breathing room factor of K=20, you have 200 of these.

You then take the top 5, or top 1, whichever you want to allow into your prompt, as the final downselection.

Alright … but what if one of the models goes down?

So, this is easy to handle. For the case of one model down, say model 3 is down, you set a_3 = 0 but you have to redistribute this weight across the other weights, so you multiply each correlation by 1/(1-\alpha_3) to restore the weighted sum back to one. If you have two models out, say model 3 and model 9, you multiply by 1/(1-\alpha_3-\alpha_9)

So you would multiply the \rho_{i,k} values by these adjustment factors 1/(1-\sum_{q_z \subset \{1,2,...,N\}} \alpha_{q_z}) depending on which models are down. If none are down, this evaluates to 1, and so nothing is adjusted.

So now you can combine coherently across semantics, using multiple models in a fault tolerant way.

But what about combining with Keywords? These do not readily map to -1 to 1.

The traditional approach would be combine all of your semantic models first, coherently like I am advocating above for the best performance, then non-coherently combine this with the keywords using RRF.

So for RRF, or Reciprocal Rank Fusion, you would get a semantic rankings list, say ranks are the integers 1, 2, 3, 4, 5, and also the same rankings from the keywords list also 1, 2, 3, 4, 5.

You would then fuse these into one ranking by combining them reciprocally, same as a harmonic sum. So… equation:

RRF(d) = \sum_{d \in D} \frac{1}{c + r(d)}

Here d is the document, or chunk, that you are trying to rank, r(d) is the ranking value, so 1, 2, 3, … and c is a constant, here set it to 1 usually, but I see some folks like setting it somewhere near 60.

This is how you would combine your semantics and keywords non-coherently. If you want to reduce your keywords importance, set the numerator to somewhere between 0 and 1, for all rankings r(d) coming from a keyword sorting. Similar to weighting with the \alpha_i above, you could just have different \alpha_i as a function of which ranking you are on, so a generalized weighted RRF is:

RRF_{weighted}(d) = \sum_{d \in D} \frac{\alpha_i(r_i)}{c + r_i(d)}

Also, RSF above as another non-coherent technique. But I think this really only applies to non-coherent combinations across semantic searches. More here on RSF and RRF.

Finally, what about a pseudo-coherent combination with semantics and keywords? Well, I have been thinking about this one recently too. It would involve mapping the keyword correlations to the \tanh function (hyperbolic tangent), so it maps anything to an output range of -1 to 1, similar to the embeddings, and then you and coherently combine this as another pseudo embedding leg, without resorting to RRF.

Here I would fit your coherent combinations to solve for a bias and scale factor on the keyword correlation strength. I haven’t done this one yet, but the idea is to do some sort of least squares fit to solve for the bias and scale factor so that your keyword correlation strength is in the linear part of \tanh and correlates well to your coherent combinations from your ensemble of embedding models.

N2U · March 2, 2024, 9:09pm

I asked GPT to implement your math, I take zero responsibility for its correctness, but I thought you might find it entertaining:

import torch
import torch.nn.functional as F

def combine_embeddings(text, embedding_models, knowledge_stores, alphas, keywords, keyword_rankings, c=3):
    """
    Combines multiple embedding models and keyword strategies in a fault-tolerant and coherent manner for semantic searches.
    
    Parameters:
        text (str): The input text to be embedded.
        embedding_models (list): A list of embedding models.
        knowledge_stores (list): A list of tensors representing the knowledge store for each model.
        alphas (list): Weights for each embedding model, summing to 1.
        keywords (list): A list of keywords for non-coherent combination.
        keyword_rankings (dict): A dictionary with keywords as keys and their rankings as values.
        c (int, optional): A constant for the Reciprocal Rank Fusion formula. Defaults to 3.
    
    Returns:
        torch.Tensor: The final combined scores from embedding and keyword strategies.
    """
    embeddings = [model(text) for model in embedding_models]
    correlated_scores = []
    for i, embedding in enumerate(embeddings):
        correlations = torch.matmul(knowledge_stores[i], embedding)
        top_k_values, _ = torch.topk(correlations, k=K)
        correlated_scores.append(top_k_values)

    down_models = [i for i, alpha in enumerate(alphas) if alpha == 0]
    active_alphas = [alpha for i, alpha in enumerate(alphas) if i not in down_models]
    adjusted_weights = torch.tensor(active_alphas) / torch.sum(torch.tensor(active_alphas))

    weighted_scores = torch.zeros_like(correlated_scores[0])
    for i, score in enumerate(correlated_scores):
        if i not in down_models:
            weighted_scores += adjusted_weights[i] * score

    rrf_scores = torch.zeros(len(keywords))
    for i, keyword in enumerate(keywords):
        if keyword in keyword_rankings:
            rrf_scores[i] = 1 / (c + keyword_rankings[keyword])

    keyword_correlations = torch.tanh(rrf_scores)
    final_scores = torch.cat([weighted_scores, keyword_correlations])
    return final_scores

curt.kennedy · March 2, 2024, 9:11pm

Yeah, nice try. That code definitely has some errors.

N2U · March 2, 2024, 9:16pm

Lmao, I will admit I was a bit hard on the model, and didn’t give it any extra tokens to work with

This was the prompt used:

Implement the following in a pytorch function, provide only the full code, no explanation, just go!
### <your entire post> ###

andersonmg · March 3, 2024, 12:21am

Wonderful, Curt. Thank you for your extensive explanation.
And thank you, N2U, for your try.
That problem is intricate and challenging. I think universal embedding is an approach that must solve it better in the future.

N2U · March 3, 2024, 12:35am

You are welcome. If I were you, I would just create a full set of embeddings from one or two other providers as well. It’s fairly cheap, and you’ll have a complete system to fall back on if one of the endpoints goes offline.

curt.kennedy · March 3, 2024, 4:27pm

This is a good second choice. The only thing I don’t like about it though is that you are not using the backups most of the time, and so you aren’t getting what you paid for. Granted, the embedding costs are relatively cheap, but the storage costs could be high, so use it or lose it.

But I realize implementing what I have takes some extra work. But for the developer minded, I think it’s worth it.

One of the biggest challenges besides intermittent outages in a model, is the model being deprecated. If you are combining multiple models, your system can gracefully take this deprecation (permanent outage) without any additional work. You would then find a replacement model, or just have an open slot for another model in the future.

So it makes DevOps much easier once you implement this once, up front.

Also, if you don’t combine models, and then switch to a single fallback model during an outage, your system could behave differently, as the rankings may have a different ordering with the other model. But if you had several models, the likelihood of this re-ranking difference goes down. So your system is more stable if you use multiple models. Think of this as protection via Central Limit Theorem, as you are essentially convolving multiple models together to create one mega-model.

Another consideration that I didn’t mention, is that some models have very short contexts, while others have very long contexts. So you may get context mismatch. The optimal solution requires you to embed at the smallest context as your max context. However, a suitable sub-optimal solution is to truncate anything bigger than it’s native context, and then with the weighting scheme, per model, you could weight the model less, on the fly, if you detect that the context limit was broken. You can due this for each chunk too … have different weights depending on the exact situation with that chunk. So weights per chunk and model, if the chunk had to be truncated.

So the system I have outlined above has a lot of flexibility, robustness, even tuning to each specific chunk.

If you need (or desire) smooth DevOps, a system that changes smoothly if an outage occurs, and can handle all the typical nuances of using multiple models (varying context lengths), you should consider building a fault-tolerant multi-model embedding system, similar to what I have outlined above, for your RAG application.

N2U · March 3, 2024, 6:05pm

Completely agree, but that’s kinda the thing with backups. Redundancy is not cost effective, because we’re just multiplying the cost with zero actual returns until something breaks.

But oh boy, I was happy that I had backups today when I woke up to a broken hard drive

curt.kennedy · March 3, 2024, 6:28pm

Right, redundancy does have a cost.

But my solution is like a RAID array ( RAID 10 (1+0)???), which is where your hard drive fails, you don’t even notice it because the redundant system automatically kicks in, and there is no downtime, just some red flashing light saying something is wrong and needs replenished or looked at.

The alternative, is let the hard drive fail, put in another hard drive, start the whole thing over. Much more painful. The goal here is to avoid that pain.

N2U · March 3, 2024, 7:07pm

I like the raid 10 comparison

Had one of those for a while, it was amazing, but it does require 4X the amount of hard drives to get 2X performance.

What you’re proposing does sounds like raid 5 or 10, aka striped data across multiple drives with parity. it’s the solution with all the bells and whistles, and some extra overhead.

Is it actually needed though? I’m not convinced that it’s worth it compared to a raid 1 setup with failover?

curt.kennedy · March 3, 2024, 7:19pm

Just like all the RAID array options and tradeoff’s, it all depends on your situation.

So yes, it looks like I am advocating a fancy RAID array, and yours is more basic.

The user has to decide which one they want, and the tradeoffs.

For this embedding model redundancy, I am sure we could come up with multiple permutations, that map to different levels of “RAID equivalents”.

Topic		Replies	Views
Discussion thread for "Foundational must read GPT/LLM papers" Community gpt-4 , gpt-35-turbo , chatgpt , research	75	11570	September 3, 2024
BERT better than Ada 002? API embeddings , api , ada002	11	7061	November 13, 2023
Document Sections: Better rendering of chunks for long documents Prompting vector-db , semantic-search	66	33567	April 1, 2025
RAG is failing when the number of documents increase API	35	21837	December 17, 2024
Fine-tuning myths / OpenAI documentation API	24	15308	December 23, 2023

Embeddings from multiple providers?

Related topics