as opeai mentioned performace of large model is still same even after reducing dimensions ? so wanted to to testing of it on benchmarks
By default, the length of the embedding vector will be 1536 for text-embedding-3-small
or 3072 for text-embedding-3-large
. You can reduce the dimensions of the embedding by passing in the dimensions parameter without the embedding losing its concept-representing properties.
tried this
from mteb import MTEB
from sentence_transformers import SentenceTransformer
embedding_model = "text-embedding-3-small"
embedding_encoding = "cl100k_base"
max_tokens = 8000 # the maximum for text-embedding-3-small is 8191
encoding_model = tiktoken.get_encoding(embedding_encoding)
evaluation = MTEB(tasks=["CQADupstackPhysicsRetrieval"])
results = evaluation.run(encoding_model, output_folder=f"results_openai/{encoding_model}")
im getting error:
Retrieval
- CQADupstackPhysicsRetrieval, beir, s2p
100%
38316/38316 [00:00<00:00, 55555.36it/s]
ERROR:mteb.evaluation.MTEB:Error while evaluating CQADupstackPhysicsRetrieval: Encoding.encode() got an unexpected keyword argument 'batch_size'
---------------------------------------------------------------------------
TypeError Traceback (most recent call last)
<ipython-input-78-a01182201c3e> in <cell line: 13>()
11
12 evaluation = MTEB(tasks=["CQADupstackPhysicsRetrieval"])
---> 13 results = evaluation.run(encoding_model, output_folder=f"results_openai/{encoding_model}")
5 frames
/usr/local/lib/python3.10/dist-packages/mteb/abstasks/AbsTaskRetrieval.py in encode_queries(self, queries, batch_size, **kwargs)
116 "Queries will not be truncated. This could lead to memory issues. In that case please lower the batch_size."
117 )
--> 118 return self.model.encode(queries, batch_size=batch_size, **kwargs)
119
120 def encode_corpus(self, corpus: List[Dict[str, str]], batch_size: int, **kwargs):
TypeError: Encoding.encode() got an unexpected keyword argument 'batch_size'