Transitioning to the new embeddings models from ada

Hi,

We currently use Pinecone as our vector db where we have been storing vectors generated by ada-002 for the past year for use in our product.

If we want to transition to use the new v3 embeddings models - what’s the smoothest way to handle this from a product perspective?

I don’t want to have to handle re-embedding everything as there’s lots of unique IDs that are interlinked throughout our product with the existing embeddings. Given the number of companies our there in similar positions I’m curious what the best solution is for making the transition? Is it to use both in parallel somehow? Can the 1536 dimension vectors from v3 small be used in the same index as a 1536 vector generated by ada?

Thank you,

A

2 Likes

There may be some mathematical solution to this in time. There is currently no method for “upgrading” an ada-002 embedding vector to a new model that I am aware of. You may be able to recreate the embeddings from the associated text stored with each embedding in your Pinecone DB, presumably the entirety of the chunked text exists in there, and could be re-embedded.

There are only two solutions for this

  1. Re-embed your entire content
  2. Have a conditional in your code to handle old and new knowledgebase content and based on the request handle using new or old embeddings

No … you can’t combine ada-002 with 3-large or 3-small. I verified this with 1536 out of 3-small to ada-002, and also taking 3-large, truncating/rescaling to 1536, and this didn’t compare with ada-002 either.

So these models just aren’t compatible.

You would have to re-vector, re-embed, etc. And figure out how to transition between the new database and the old one.

One way to do this is to run everything on the old model, while you build the new one in the background. Once the new one is built, switch your operations over to this other DB and new embedding model for new incoming retrievals and queries.

I don’t know if there is a way in Pinecone to store different vectors in the same database. In my case, not using Pinecone, I would store all the raw data in a NoSQL type of database to store the vectors, extract them at “build time” being mindful of the model they come from, and create the memory structures from only the latest model, for compatibility reasons.

2 Likes

I am curious if anyone has any thoughts on how/if these new embed models will improve cosine similarity results against their embeddings.

Will 3072 vector dimension be a huge benefit over 1536 in terms of accuracy of results?

Results are on MTEB now. I’m a bit disappointed. Other models with much smaller dimensionality behave quite a bit better.

Anyway, any type of improvement is nice of course but the benefits here seem very small.

2 Likes

Based on MTEB scores, the 3072 dimension is competitive with the SOTA models. The exact ranking depends on what you are doing.

For example, RAG is useful to me. So here is the MTEB rankings for 3-large @ 3072

3-large @ 3072 MTEB rankings for RAG:

4th for Retrieval @ 55.4 score (top is e5-mistral-7b-instruct @ 56.6 score)

Will you notice a difference in a score of 56.6 vs. 55.4? Not sure.

MTEB Retrieval uses BEIR :beers: :beer:

But what I do know is to get into the current “SOTA club” you need to go with 3-large at 3072 dimensions.

When there, you can relax and drink a (from the paper)
Screenshot 2024-01-26 at 4.21.32 PM

:rofl:

Sadly, 3-small ranks 17th at 1536 dimensions, so it’s not starting off on a good foot.

For reference ada-002 ranks 27th, so it’s already antiquated, and left for dead. :zombie: :man_zombie: :woman_zombie:

My plan is to use 3-large at 3072, and rapidly synthesize the lower dimensions, if needed for speed, on my own as discussed over here.

The sad thing about embedding models is that they are fixed in time, and as MTEB rankings come out every day, your favorite model inevitably starts dropping in rankings.

But you can always use multiple embedding models at once, and fuse them with RSF or RRF. Maybe shift the hybrid rankings over time, so de-emphasize sunsetting models and emphasize recent performers.

The trick here for the different models at once is the context length, as they vary all over the board. And each model provider has different latencies. So there are many other considerations to factor in here.

But in theory, you could do parallel API calls to, say, 5 models and fuse all the rankings into one.

The one good thing about this approach of model diversity is you get massive uptime, because you don’t rely on just one model.

Also, as models sunset, you can focus on getting the new ones online, while your model clusters are all working in parallel giving their weighted inputs into your RAG.

So you have a lot more slack as you transition to new models over time, as they are continuously blending and re-weighted over time.

You could even shut down ada-002, if you coded it right, while your other models take up the slack, while you get a replacement established, and fused into the cluster.

This is what I consider good EmbedOps, but honestly, I’m not always good at it myself because it takes work to get all these things spun up, and keep up on the latest models.

Lot’s of plates spinning in the air. Cognitive load :sweat_smile: So not my first choice, but it’s a higher bar to aspire to reach.

3 Likes

It does raise an important point that I’ll admit I have not considered before, keeping a detailed build process ready for when new models come online, so that you can trivially rebuild your vectors.

I think many, including myself, have just treated the actual embedding phase as a one off, rather than a reusable and tested part of data management.

3 Likes

A benchmark however needs to be made specifically for embeddings as one would use them in typical query-to-knowledge-retrieval:

  • small inputs (user questions about document domains) vs large inputs (chunks with human-ranked relevance or with no relevance) as used in semantic search retrieval-augmented AI generation.

Then discrimination and clarity and consistency of threshold can be the ultimate result, along with ability to find needle in the chunk. Approaches that value 8k context vs techniques to score 8k in smaller models. Scoring based on scoring.

As one can see, that needs a lot of human preparation and judgement, along with Swahili speakers’ input, so is more evasive.