How to reproduc benchmark results for new embedding v3 models ? ada vs v3 small & large models

aksdesai1998 · January 27, 2024, 10:52am

as opeai mentioned performace of large model is still same even after reducing dimensions ? so wanted to to testing of it on benchmarks

By default, the length of the embedding vector will be 1536 for text-embedding-3-small or 3072 for text-embedding-3-large. You can reduce the dimensions of the embedding by passing in the dimensions parameter without the embedding losing its concept-representing properties.

aksdesai1998 · January 27, 2024, 11:50am

tried this

from mteb import MTEB
from sentence_transformers import SentenceTransformer
embedding_model = "text-embedding-3-small"
embedding_encoding = "cl100k_base"
max_tokens = 8000  # the maximum for text-embedding-3-small is 8191

encoding_model = tiktoken.get_encoding(embedding_encoding)


evaluation = MTEB(tasks=["CQADupstackPhysicsRetrieval"])
results = evaluation.run(encoding_model, output_folder=f"results_openai/{encoding_model}")

im getting error:

Retrieval
    - CQADupstackPhysicsRetrieval, beir, s2p

100%
38316/38316 [00:00<00:00, 55555.36it/s]
ERROR:mteb.evaluation.MTEB:Error while evaluating CQADupstackPhysicsRetrieval: Encoding.encode() got an unexpected keyword argument 'batch_size'
---------------------------------------------------------------------------
TypeError                                 Traceback (most recent call last)
<ipython-input-78-a01182201c3e> in <cell line: 13>()
     11 
     12 evaluation = MTEB(tasks=["CQADupstackPhysicsRetrieval"])
---> 13 results = evaluation.run(encoding_model, output_folder=f"results_openai/{encoding_model}")

5 frames
/usr/local/lib/python3.10/dist-packages/mteb/abstasks/AbsTaskRetrieval.py in encode_queries(self, queries, batch_size, **kwargs)
    116                     "Queries will not be truncated. This could lead to memory issues. In that case please lower the batch_size."
    117                 )
--> 118         return self.model.encode(queries, batch_size=batch_size, **kwargs)
    119 
    120     def encode_corpus(self, corpus: List[Dict[str, str]], batch_size: int, **kwargs):

TypeError: Encoding.encode() got an unexpected keyword argument 'batch_size'

Topic		Replies	Views
Using new embedding models API	5	3357	July 26, 2024
MTEB benchmark for v3 small embedding model with 256 dimensions API embeddings	2	2346	February 4, 2024
Embeddings performance difference between small vs large at 1536 dimensions? API embeddings , vector-db	11	14212	April 13, 2024
Transitioning to the new embeddings models from ada API embeddings	8	6002	January 27, 2024
Are OpenAI text-embedding-ada-002 embedding model greater than text-embedding-3-large? Community embeddings , chatgpt , api	1	1945	February 21, 2024

How to reproduc benchmark results for new embedding v3 models ? ada vs v3 small & large models

Related topics