Embeddings performance difference between small vs large at 1536 dimensions?

Is there a performance difference between text-embedding-3-small @ 1536 length and text-embedding-3-large @ 1536 length?

1 Like

Using kapa.ai on OpenAI Discord

Prompt

For the following text, summarize the idea, extract meaningful questions and give possible answers.

Is there a performance difference between text-embedding-3-small @ 1536 length and text-embedding-3-large @ 1536 length?

Reply

Idea Summary: The user is inquiring about the performance difference between two of OpenAI’s third-generation embedding models: text-embedding-3-small and text-embedding-3-large . Both models have an output dimension of 1536.

Extracted Question: Is there a performance difference between text-embedding-3-small and text-embedding-3-large , both having an output dimension of 1536?

Possible Answer: Yes, there is a performance difference between the two models. According to the documentation (https://platform.openai.com/docs/guides/embeddings), the text-embedding-3-large model is the most capable embedding model for both English and non-English tasks, while the text-embedding-3-small model offers increased performance over the second-generation ada embedding model. In terms of specific performance metrics, the text-embedding-3-large model has an example performance of 64.6% on the MTEB (GitHub - embeddings-benchmark/mteb: MTEB: Massive Text Embedding Benchmark) evaluation, while the text-embedding-3-small model has a performance of 62.3%. Therefore, the text-embedding-3-large model performs slightly better than the text-embedding-3-small model.

1 Like

So far, that’s not my experience. I’ve found it inferior to ADA 2 in real world tests. YMMV.

1 Like

I think the performance difference would be that if you requested 1536 dimentsion from text-embeddings-3-large, you would get an API error, as it is not a supported parameter.

However, 1024 that is supported is essentially the first 1024 dimensions returned out of the 3k. It is hard to imagine that if you truncated to 1536 manually yourself instead you would get worse semantic performance then at 1024.

The quality, parameters, and the computation dedicated to the model you can infer from the price. Then compare cost + dimensionailty to figure where it fits into the scheme of ada-babbace-curie 100% gpt-3 embeddings.

I am able to generate 1536d vectors using large

1 Like

Interesting. It appears that the dimensions API parameter is just a truncater and normalizer

2048:

{‘data’: [{‘embedding’: [0.018731665, -0.03743733, -0.0029751628, -0.007448469, -0.0030954042], ‘index’: 0, ‘object’: ‘embedding’}], ‘model’: ‘text-embedding-3-large’, ‘object’: ‘list’, ‘usage’: {‘prompt_tokens’: 6, ‘total_tokens’: 6}}
len: 2048

4:

{‘data’: [{‘embedding’: [0.4394683, -0.87832665, -0.06980105, -0.1747504], ‘index’: 0, ‘object’: ‘embedding’}], ‘model’: ‘text-embedding-3-large’, ‘object’: ‘list’, ‘usage’: {‘prompt_tokens’: 6, ‘total_tokens’: 6}}
len: 4

4096:

API Error: Error code: 400 - {‘error’: {‘message’: “Invalid value for ‘dimensions’ = 4096. Must be less than or equal to 3072.”, ‘type’: ‘invalid_request_error’, ‘param’: None, ‘code’: None}}

The blog makes a specific point of particular dimensions, but I hadn’t fuzzed the inputs to see what it could do.

By default, the length of the embedding vector will be 1536 for text-embedding-3-small or 3072 for text-embedding-3-large. You can reduce the dimensions of the embedding by passing in the dimensions parameter without the embedding losing its concept-representing properties.

1 Like

Yes, see:

Yes, and to expand on my contribution to that topic…

Here’s another thought:

If reduced dimensions were remapped so that dimensions most relevant for obtaining semantic distinguishment were placed first to tolerate truncation, trained by extensive trials and then sorting the output order, how would that be done?

By targeting against a benchmark.

One might postulate then, in making a truncatable embeddings model, that benchmarks such as MTEB and others might have been used for discovery of dimensions with highest applicability to known tasks.

Thus, those reduced embeddings dimensions by parameter specification may be more performative against benchmarks than in general or novel use.

The challenge then is in coming up with “unseen” cases to qualify different 1536-dimension output, available from all API models. To then find out if half of 3-large takes a larger hit than a single metric shows. Find out how poor the second-half is…

2 Likes

Given the MTEB scores in the blog post which reveal that text-embedding-3-large @ 256 length reaches a higher score than text-embedding-3-small @ 512 length, I would say it is likely that text-embedding-3-large @ 1536 length also outperforms text-embedding-3-small @ 1536 length.

1 Like

Nice observation. Thank you. I’ll use large for my 1536

1 Like

But again, that’s a case where the truncation designer can specifically target the benchmark and the dimensions that perform best on it, and where on a large parameter model, there are more to choose from. There’s no promise from OpenAI that MTEB score alone isn’t exactly what the lower dimensionality targets.

1 Like

Thank you on this. I was thinking the same because of openai’s documentation. I was bracing myself for some database changes having created a bunch of objects with 1536 dimension in mind. Hoping the transition will be seamless.