Based on MTEB scores, the 3072 dimension is competitive with the SOTA models. The exact ranking depends on what you are doing.
For example, RAG is useful to me. So here is the MTEB rankings for 3-large @ 3072
3-large @ 3072 MTEB rankings for RAG:
4th for Retrieval @ 55.4 score (top is e5-mistral-7b-instruct @ 56.6 score)
Will you notice a difference in a score of 56.6 vs. 55.4? Not sure.
MTEB Retrieval uses BEIR
But what I do know is to get into the current “SOTA club” you need to go with 3-large at 3072 dimensions.
When there, you can relax and drink a (from the paper)
Sadly, 3-small ranks 17th at 1536 dimensions, so it’s not starting off on a good foot.
For reference ada-002 ranks 27th, so it’s already antiquated, and left for dead.
My plan is to use 3-large at 3072, and rapidly synthesize the lower dimensions, if needed for speed, on my own as discussed over here.
The sad thing about embedding models is that they are fixed in time, and as MTEB rankings come out every day, your favorite model inevitably starts dropping in rankings.
But you can always use multiple embedding models at once, and fuse them with RSF or RRF. Maybe shift the hybrid rankings over time, so de-emphasize sunsetting models and emphasize recent performers.
The trick here for the different models at once is the context length, as they vary all over the board. And each model provider has different latencies. So there are many other considerations to factor in here.
But in theory, you could do parallel API calls to, say, 5 models and fuse all the rankings into one.
The one good thing about this approach of model diversity is you get massive uptime, because you don’t rely on just one model.
Also, as models sunset, you can focus on getting the new ones online, while your model clusters are all working in parallel giving their weighted inputs into your RAG.
So you have a lot more slack as you transition to new models over time, as they are continuously blending and re-weighted over time.
You could even shut down ada-002, if you coded it right, while your other models take up the slack, while you get a replacement established, and fused into the cluster.
This is what I consider good EmbedOps, but honestly, I’m not always good at it myself because it takes work to get all these things spun up, and keep up on the latest models.
Lot’s of plates spinning in the air. Cognitive load So not my first choice, but it’s a higher bar to aspire to reach.