BERT better than Ada 002?

Hi, community.
I’ve been experimenting with Ada002 for STS (semantic textual similarity). I used the MTEB dataset from 2016 to compare Ada 002 against three BERT models: MS Marco, MiniLM and MPNet. All three BERT models outperformed Ada002 on STS (using cosine similarity) between the dataset sentence pairs. Ada only outperforms BERT on cross-language similarity because these models are English-only.

I am surprised by these results because I would expect the vastly larger Ada to be outperform the much smaller BERT.

Does anyone have any experiments or research to show that Ada 002 is superior to BERT?

— From my paired t-tests —
I used MTEB 2016’s sentence pairs and ran them Ada 002 and the three BERT models to generate their cosine similarities. I then paired those similarities against the ‘ground truth’ MTEB’s similarity score.
According to the t-statistic, MPNet’s similarity score most closely matches MTEB’s human-determined score, but there is still a statistical difference between them.

Model: Ada02_Cos_Sim (not normalized, so cosine sim range from 0.7 - 1)
t-statistic: 49.52204858828296
p-value: 7.202349242150549e-291
The differences are statistically significant.

Model: Ada002_Normalized (range is 0.0 to 1.0)
t-statistic: 28.99078943085248
p-value: 4.145519385937961e-140
The differences are statistically significant.

Model: BERT_MSMarco (range effectively 0 - 1)
t-statistic: 22.560656918692175
p-value: 4.728459634918976e-94
The differences are statistically significant.

Model: BERT_MiniLM (range effectively 0 - 1)
t-statistic: 22.856838512450626
p-value: 4.320214756628714e-96
The differences are statistically significant.

Model: BERT_MPNet (range effectively 0 - 1)
t-statistic: 20.526835851905595
p-value: 2.3562601811713193e-80
The differences are statistically significant.

But, looking at the graphs of the outputs, the clear advantage of Ada seems to be that the groupings of the similarities seem to be much tighter (though not at all as tight as we would help).

BERT_MPNet model vs MTEB 2016 comparing BERT MPNet to MTEB’s ‘Ground Truth’ (0 to 5)

Ada 002 scores normalized to a range of 0 to 1 to match MTEB and BERT. Normalization formula is:
= (Ada002_Output - 0.7) x (1 - 0.7), where 0.7 is supposed to be Ada’s lowest sem sim value.

Not-normalized Ada002 values per MTEB score, where 5 = completely related and 0 = chaos!

Would it be fair to say that BERT is more accurate in the aggregate and Ada 002 statistically more accurate individually? (Sorry. I’m new to stats).

Finally, if you wanted to use one statistic or metric to show that one embedding model is better than another, what would you use?

Ada-002 is currently ranked 15th on the leaderboard. The best BERT model is 39th.

So …

Not sure what your stats are implying, but basically if you embed 8k tokens at a time, you have three options, and ada-002 is ranked the highest performing of these large input token models.

Going on strictly performance, you are looking at 512 tokens at best.

1 Like

Thank you, @curt.kennedy. You’re restored my faith in OpenAI and embeddings!

But I guess I still expected Ada-002 to have a significantly higher correlation with the MTEB ground truth.

1 Like

And perhaps answering my own question, I just calculated the Pearson correlation coefficient between all the models and the MTEB ground truth. Ada, both normalized and straight out the gate, have a higher correlation with the MTEB ground truth than the BERT models. That’s what I would expect / was hoping for.

Model: Ada02_Cos_Sim
Pearson R: 0.8337220622537498
Pearson p-value: 1.179042504678312e-307

Model: Ada002_Normalized
Pearson R: 0.8337220622537498
Pearson p-value: 1.179042504678312e-307

Model: BERT_MSMarco
Pearson R: 0.7613182195390986
Pearson p-value: 4.879127480463731e-225

Model: BERT_MiniLM
Pearson R: 0.7793666969782451
Pearson p-value: 1.2123189953050961e-242

Model: BERT_MPNet
Pearson R: 0.7882478211979903
Pearson p-value: 6.245934806392621e-252

1 Like

Hi, @curt.kennedy.
I have (hopefully) one last question.
You mentioned that Ada002 is the best-performing model for 8k+ text. Is that defined by the MTEB ( MTEB Leaderboard - a Hugging Face Space by mteb)?

I ask because I would like to cite that - Ada is best for text greater than 512 tokens - in my paper. Is that published somewhere or can you tell that by the model names on the MTEB leaderboard?

1 Like

Just sort MTEB by “Sequence Length”, and look at the ranking:

There are three 8k models, and ada-002 is ranked 15th, the other two are 17th and 31st.

Anything better is limited to 512 or 514 input tokens.

Note that the jina-embeddings-v2-base-en is really close to ada-002 in performance, and has only the size of 0.27 GB, and has a reduced dimension count of 768 (faster search). So for a lot of reasons, it could be better than ada-002 with only slight degradation.

My particular use case of ada-002 is kinda weird, where one thing I do is check non-English names for similarities, and so not sure an English only model will work for this. But something to consider!


There are also strong arguments to be made for using a mixture of embedding models, especially if they are small and cheap.

Also of note…

If you have a lot of data, you can fine-tune some of these embedding models, particularly jina-embeddings-v2-base-en, which is not something you can do with ada-002

This will almost certainly result in much better RAG results.

Honestly though… I think combining text-embedding-ada-002 and a fine-tuned jina-embeddings-v2-base-en will likely yield strongly superior results.

  • Create a hypothetical document for an ideal response. (HyDE)
  • Pull in data using hybrid search with both embedding engines
  • Order them with reciprocal ranking (or with a more sophisticated selection process[1])
  • Pull in expanded context around each of your results
  • Use a cheap model to combine everything together in a single synthetic document in the context of the retrieval prompt
  • Re-embed the new synthetic document with both embedding models to store for later retrieval (one possible sanity-check would be to see where this new synthetic document comes up in your retrieval search—ideally it should be the top result by keywords and similarity for both models since it is supposed to represent the best possible information in your data store)
  • Generate a response from your primary model using the synthetic document

  1. I have some ideas I have been toying with here I can share if anyone is interested. ↩︎


pls share

1 Like

I would be too worried about cross-contamination. Having the synthetic retrieval results turn back into the facts?

Maybe put in a separate database, and de-prioritize it in another RRF thread might make me less paranoid!

This is an interesting idea. I’d have to think about this one. Is it necessary though? Is this just to save downstream tokens in a more expensive model?

While you can’t directly fine-tune ada-002, have you thought about feeding the embedding vector into another model that you train create from scratch, like this simple FFNN. (like this Kaggle notebook?).

It’s probably not the best example, but the idea is you feed in the embedding vector, from a particular model, and it takes that vector as input and responds with some classifier label that you have trained it on.

It might be more work than it’s worth. But it’s a way to take the embedding vector from one system, and translating it to your domain.

It’s probably much easier to have your embeddings “labeled” and anything new within a certain cosine similarity of that same labeled item also is of that label.

But I’m trying to wrap my head around fine-tuning a Jina embedding model, have you tried it?

1 Like

Essentially, you want to maximize the amount of relevant information you retrieve, right?

Imagine you’ve embedded a huge number of documents. Some are revisions of earlier documents, some are summary documents, etc.

Let’s say the ideal retrieval ultimately requires combining three different sets of facts or ideas, let’s call them \mathcal{A}, \mathcal{B}, and \mathcal{C}.

You might have many documents that relate to each of these ideas separately, and even several that relate to \mathcal{A} \cap \mathcal{B}, but no documents that relate to all three ideas.

It’s likely that when retrieving information, all of your top results are very similar—they all relate to the same combination, \mathcal{A} \cap \mathcal{B} because that’s the closest match to what you’re looking for since it includes two out of the three disparate ideas.

It’s very possible that your retrieval in this case could completely miss pulling in any specific information about idea \mathcal{C}.[1]

So, what I propose is something that comes from my field of experiments design[2]orthogonal subsampling.

You want to pull in resources which are a very high match for the desired generation but they should be as different from each other as possible.

Here's a toy example from experimental design

Say you have a machine that makes widgets.

  • The machine has two dials (call them x_1 and x_2) which can be set continuously between 0–1.
  • The widgets can be objectively measured for quality on a scale of 0–1.
  • You have a set number (n) of tests you can perform before you have to do a production run of, say 1,000,000, widgets.

Your goal is to figure out the combination of dial settings that produces the best widgets.

I’m not going to go too deeply into the theory and finer details here but if the number of runs was n = 4, without any other information, the runs you’d want to do are,

x_1 x_2
0 0
1 0
0 1
1 1

The gist is this design encloses the most area possible in the design space, the (centered) column-vectors are orthogonal to each other, and the measured points are maximally distant from each other.

We can employ principles of experimental design to sampling by trying to choose a sample that is as similar to what we would have designed.

Turning back to the idea of embeddings before I stray too far afield…

If you were planning on injecting two embeddings into context you might have three results with cosine-similarity scores of, say, 0.92, 0.90, and 0.86. At first glance the obvious choice is to include the first two—they’re more similar to what you are looking for. But if the cosine-similarity between the first two is, say, 0.99 that means they are essentially the same embedding, and you gain nothing from including them both. But, if the cosine-similarity between the first and third is, again let’s just say, 0.63 then we can be more confident that the information in the third is different than the information in the first (or second).

So, under the assumption that your embeddings are all (roughly) the same number of tokens, we get more total information added to the context by including the first and the third than we would by including the first and the second.

Something that wouldn’t be picked up simply by ranking them.

But, you could have an even more complicated setup, maybe your first ten results have cosine-similarity scores to your query ranging between 0.92 and 0.84, but they all have a cosine similarity with each other over 0.90, then you have your eleventh retrieval with a cosine-similarity score of 0.82 with a maximum cosine-similarity to any of the top-ten of 0.65. You might want to go even further down to pick up this eleventh retrieval under the idea that it might have “new” information not contained in the first ten.

Incidentally, this is one of the problems with the embedding space not being truly filled, but rather existing in a relatively small cone—nothing is truly orthogonal, so we can’t be absolutely certain if, in this case, the eleventh retrieval has different information or just less information. But, in general, I think we should assume that as the cosine-similarity between two retrievals approaches the observed minimums, the information contained is more different.

Apologies if that got a bit rambly, I promise I went back through an cut out huge swaths to try to make it more streamlined and readable.

  1. There are other ways to mitigate this too, I’m just laying out a general retrieval strategy that may also help in other ways beyond this toy example. ↩︎

  2. Something I co-authored a paper on ↩︎

1 Like

I 100% understand this concern. If the synthetic document is a good summary of the retrieval and all of the retrieved documents are relevant I think it’s safe-ish. But, that’s an admittedly pretty big “if.”

There is a danger if you’re looking for a description of a dog and you have one document saying the dog is big and another document (talking about a different dog) saying the dog is red. It could be problematic to have a new document saying the do is big and red, especially if the big dog is blue and the red dog is small.

I would certainly include columns in the database to indicate it is a synthetic entry and pointing to the primary source documents.

Necessary? No, I don’t think so. I think it would ultimately save on token cost (especially if the retrieval will remain in context for multiple messages), so that is one benefit.

The main reason I think it is valuable is because it allows the model to focus its attention on what is important more easily.

If you take in several retrieved documents, there will undoubtedly be,

  1. Overlap between them. It is plausible to suspect the model might place undue emphasis on ideas or facts which are repeated multiple times.
  2. Extraneous miscellany. Depending on the size of your embeddings, the information the model actually needs may be only a small part of the whole. In addition to increasing token usage by leaving them in, they could serve to distract the model (especially if they are repeated via the overlap described above).

I think the second part is far more important given gpt-4’s current problem solving skills. By way of example,

ChatGPT failing on a grade-school level word-problem with extraneous information

If we remove the extraneous information, ChatGPT has no difficulty answering the question correctly,

No, that’s new to me! interesting stuff I’ll have to play with… later.

I’ve not yet had the opportunity to fine-tune a Jina embedding model. I’m still working through my big data acquisition and cleaning project. Towards that end…

I am considering extending this,

GitHub - facebookresearch/nougat: Implementation of Nougat Neural Optical Understanding for Academic Documents

To be a multimodal model focusing only on e-born PDFs

Basically, instead of just doing OCR—which is great by the way—the idea would be to send in the extracted text from a page as well. Which should eliminate misreads (e.g. rn being read as m) and fixing some displaymode recognition like \widetilde{X} being read as \overline{X} both due to the relatively low-resolution images they went with for their encoder model (672x896).

Since every PDF I need to ingest is e-born and I have no need for had-written, scanned, or image PDF acquisition at this time, I think I might be able to get get near-perfect results with a multi-modal model. :crossed_fingers:

I just don’t have MetaMoney to be able to do the training with, so I am presently trying to kludge together some kind of makeshift solution…

Edit: Just saw this today,

Which is certainly interesting.

1 Like

sorry to interrupt you, isn’t this what usually is addressed using Max Marginal Relevance (see this for further references)?