Transitioning to the new embeddings models from ada

curt.kennedy · January 26, 2024, 11:52pm

Based on MTEB scores, the 3072 dimension is competitive with the SOTA models. The exact ranking depends on what you are doing.

For example, RAG is useful to me. So here is the MTEB rankings for 3-large @ 3072

3-large @ 3072 MTEB rankings for RAG:

4th for Retrieval @ 55.4 score (top is e5-mistral-7b-instruct @ 56.6 score)

Will you notice a difference in a score of 56.6 vs. 55.4? Not sure.

MTEB Retrieval uses BEIR

But what I do know is to get into the current “SOTA club” you need to go with 3-large at 3072 dimensions.

When there, you can relax and drink a (from the paper)
Screenshot 2024-01-26 at 4.21.32 PM

Sadly, 3-small ranks 17th at 1536 dimensions, so it’s not starting off on a good foot.

For reference ada-002 ranks 27th, so it’s already antiquated, and left for dead.

My plan is to use 3-large at 3072, and rapidly synthesize the lower dimensions, if needed for speed, on my own as discussed over here.

The sad thing about embedding models is that they are fixed in time, and as MTEB rankings come out every day, your favorite model inevitably starts dropping in rankings.

But you can always use multiple embedding models at once, and fuse them with RSF or RRF. Maybe shift the hybrid rankings over time, so de-emphasize sunsetting models and emphasize recent performers.

The trick here for the different models at once is the context length, as they vary all over the board. And each model provider has different latencies. So there are many other considerations to factor in here.

But in theory, you could do parallel API calls to, say, 5 models and fuse all the rankings into one.

The one good thing about this approach of model diversity is you get massive uptime, because you don’t rely on just one model.

Also, as models sunset, you can focus on getting the new ones online, while your model clusters are all working in parallel giving their weighted inputs into your RAG.

So you have a lot more slack as you transition to new models over time, as they are continuously blending and re-weighted over time.

You could even shut down ada-002, if you coded it right, while your other models take up the slack, while you get a replacement established, and fused into the cluster.

This is what I consider good EmbedOps, but honestly, I’m not always good at it myself because it takes work to get all these things spun up, and keep up on the latest models.

Lot’s of plates spinning in the air. Cognitive load So not my first choice, but it’s a higher bar to aspire to reach.

Topic		Replies	Views
Quality of embeddings using davinci-001 embeddings model vs. ada-002 model API embeddings	15	3828	April 9, 2024
Better performance using text-embedding-3-large? API embeddings	5	5132	February 7, 2024
Embedding Results Scale Seems Off API embeddings , ada	8	4832	December 24, 2023
Using new embedding models API	5	3150	July 26, 2024
Anyone else testing the new V3 embedding models for QnA? API embeddings	6	1693	January 28, 2024

Transitioning to the new embeddings models from ada

Related topics