Quality of embeddings using davinci-001 embeddings model vs. ada-002 model

I have noticed a very significant degradation of quality in terms of relevance scoring (cosine similarity) using the ada-002 embeddings model compared to the davinci-001 embeddings model. Has anyone noticed the same? Does anyone else consider this an urgent problem? My use case is high-stakes involving complex legal language. I can’t believe the quality reduction since I re-embedded all of my text using ada-002. I have sent an email to support but looking for advice here. Thanks.


It says right there that that it is better. :zipper_mouth_face:

The new model, text-embedding-ada-002, replaces five separate models for text search, text similarity, and code search, and outperforms our previous most capable model, Davinci, at most tasks, while being priced 99.8% lower.

Impressive claim considering:

model price/1k dimensions parameters
text-similarity-davinci-001 $0.20 12288 175B
text-embedding-ada-002 $0.0001 1536 (ada-class, ~350M)

You may have a case needing full LLM language fluency. Not most tasks.

Azure will be running these deprecated models longer before shutoff if you don’t want to again embed at a similar price to davinci were there an alternate provider at that level. Deploy now if you can…

There are other embeddings providers, if you do smaller chunks, a bit more multi-part chunk combining or adapt use to smaller context length. “Cohere” is topping leaderboards.

1 Like

Yes, I said my use case was complex and high-stakes linguistically. And yes, its not most tasks. But these sophisticated models are needed most for sophisticated language tasks. Users should be able to test new models on their datasets and decide for themselves which model is better. And should retain access to earlier models if better. Otherwise their customers are put in a bind having to look for new solutions and redo their hard work chunking text and writing new code. Terrible.

1 Like

OpenAI has quietly put forth the possibility of dedicated instances for enterprise customers. That may or may not apply to GPT-3 models beyond their bye-bye date. Figures to say hello to OpenAI’s sales team are large fractions of a million dollars.

You could be the sole worldwide provider of continuing davinci embeddings besides continuing your own use…

1 Like

Good idea! I’ll give that some serious thought. It was truly world class. Now, I just don’t know what I’m going to do.

1 Like

If Ada-002 doesn’t float your boat, then check out the embedding leaderboard:

I know that currently the top ranked Cohere models are available through API on AWS.

But you have to be cool with only embedding up to 512 tokens max at a time. :man_shrugging:

Other leaderboard variants, may perform well enough and are small enough to run on your own infrastructure. Or you may find API versions for the bigger hard to wrangle ones.

One thing though, is Davinci had some insane 14k dimensional vectors. So going to any newer model will likely be ~1k dimensional or less. This will massively speed up your search, and reduce DB storage costs as well. So you should probably start shopping around simply due to the massive vectors that Davinci put out.

Are they really that good?

1 Like

[unproductive commentary removed]

One thing you could investigate is llama 2 (or similar) embeddings. I’m not sure if they’re mature enough yet, and this is absolutely selfish of me, but I would absolutely be delighted if you tried them and and reported your findings :grin:

This has little to do with your use case, but if someone wanted to try out llava embeddings and share their experience, that would be absolutely delightful :star_struck:

I’d be happy to try out llama 2 embeddings.

1 Like

Thanks for the information Curt. Yes, they were really that good. Probably because there were so many more dimenstions, the nuances and complexity of legal language were captured more precisely. That’s my hypothesis anyway.

1 Like

It makes sense the more dimensions capture more detail.

I suspect the embedding dimensions will increase over time.

But as it stands now, the trend is to go lower in dimensions.

According to MTEB, davinci-001 had the largest number of dimensions (ever thus far).

But the MTEB rank for Davinci-001 was only 83. So it may be the case davinci-001 is more of a specialist model (your complex legal inputs), but doesn’t generalize well to general use cases. Just a theory.

Whereas ada-002 is still the top ranked 8k context model available according to MTEB. I even use it for small context. But because of the massive 8k context, I can use it for anything … from pages and pages of material at once, to small phrases, or even single words or names. (BTW, I use it for name similarity investigations (1-3 tokens), works great in this context, even though what small things are embedded aren’t even words.)

The bigger chunks I embed, the less amount of vectors I need. But the vectors might get diluted.

So …

To counteract this, I have a TF-IDF like keyword system. Keywords are tuned to each domain based on rarity.

You can also fuse the keyword rankings and embedding rankings to form an overall ranking with RRF, or reciprocal rank fusion.

So besides using just embeddings, the RRF with a keyword leg may get you there. There is more information on my approach on this forum if you are interested. It’s my “MIX” algorithm, and is essentially a log-normalized TF-IDF algorithm inspired by information theory.

I can see this hybrid approach possibly help, especially since it sounds like you need something to boost your results after the deprecation of davinci-001.


I’m afraid that this industry may have a predilection for chasing outdated metrics. MTEB is huge, but I’m wondering if we’re not optimizing for something we don’t really need.

But your approach, in theory, tries to solve the focal plane problem.

Abstractly speaking, your TF-IDF is a sort of zero depth ~170000 dim embedding. it’s almost at the shallowest possible embedding level. You can go a level deeper with word vectors and word nets and such, where you’re starting to operate on the synonym level.

I don’t know how deep ada is, but I could guess it goes at least around 10 layers deep, shrinking to the 1.5k dims. At this depth, we’re no longer really concerned with counting words, but rather the incidence of concepts and analogues.

Sometimes these concepts are too diffuse to properly or meaningfully resolve at this level, so you have a shallower backup layer to fall back to. If the concepts are simple and discriminable by terminology, you’ll favor layer 0, but if concepts are abstractly described, you’d rather go for layer 10.

but I suspect what OP wants (and what I want) is to go more abstract. sometimes words are wholly inadequate, and sometimes layer 10 concepts can’t properly describe it - so we want to shift the focal plane to layer 20.

since we’re all so used to operating at the google keyword level, and considering that most casually generated text is pretty simple, I can imagine that relatively shallow embeddings are more than enough for the majority of current consumer and enterprise level use-cases.

however, we want to push the envelope. we need deep plenoptic embeddings.

1 Like

Yeah it would be cool to get the normalized layer values at any given depth.

Some depth: n parameter. This controls the level of meaning … surface level for early layers, deeper integrated meaning for later layers.

Or another parameter would be how many output dimensions dims: x. This controls the amount of reporting detail you need. So not much for small dims, or highly detailed for higher dims.

Or call both, pick the depth and the dimensions!

But you are right, the keyword based representation is equivalent to a large vector, and sparse since mostly zeros, so it has no averaging or “forced understanding” that the hidden layers have.

But until the embedding models become parametrizable (depth/dims) you have to form an answer over an ensemble of embedding models (sparse and dense) to get that good focal plane variation.

This embedding ensemble is sorta like HDR imaging techniques.


Yeah I was thinking of light field photography and microscopy/transmission tomography, but HDR might also be a good analogue.

But until the embedding models become parametrizable (depth/dims) you have to form an answer over an ensemble of embedding models (sparse and dense) to get that good focal plane variation.

I think llama could potentially be a good test candidate, with the 7, 13, 70 sizes.

1 Like

To update you on my fix so far: I switched to ‘gpt-4-1106-preview’ to give myself more tokens for the context window. That way, more of the embedded text gets included in my prompt, which means the answers are better. With the davinci embeddings, I always got the best embeddings matches first so the token limit didn’t matter. With ada-002, it seems to have helped having 1106-preview’s larger token limit. I’d still rather have the better embedding model back, but at least I don’t feel like I have to literally shut my site down and stop doing customer demos! I was waiting to switch to gpt-4-1106-preview until OpenAI said it was ready for production traffic, but since the traffic on my site is not huge, I decided to use it and it’s helped. Interesting to think about the trade-off between the quality of embeddings for complex text and the token limits in RAG applications. Thanks for your input on my problem.


Further update: even with gpt-4-1106-preview, the weakness of the embeddings is clear. Plus the way gpt-4-1106 follows instructions isn’t as good as gpt-4. Lots of trade-offs going on in terms of prompt instructions, number of search results to retrieve etc. in order to wrangle the answers I want. With gpt-4 and davinci embeddings, it all worked so beautifully…frustrating.

1 Like