Im working on a project using embeddings (text-ada-2)
Has anyone seen any significant jump in performance using text-embedding-3-large, specifically using more dimensions? i.e. 3,072
I wonder if using more dimensions packs the meaning of chunks better and allows for better / more effective vector search. If anyone has seen any performance boost using the new large model Id love to know.
We’ve seen that the new models allow concepts to be orthogonal (cosine similarity near 0) , which was a near impossibility with ada.
this allows you to outright reject documents without using an arbitrary cutoff.
Whether you need that many dimensions depends on your actual use-case.
Of course, search slows down with more dimensions, but what you can do is use the
dimensions parameter to subsample your embeddings while still taking advantage of the “smarter” model.
There is a substantial difference to ADA in my opinion - but it doesn’t necessarily have anything to do with the dimensions per se.
Does that sorta answer your question?
Your entire question was exactly what I wanted to ask. Have you found out anything?
Thanks, yeah I think its a lot more nuanced than I originally thought.
it looks like there might be some possible gains with respect to accuracy and maybe like you said I can still see those gains by suing the new model with maybe the same # of dimensions.
I guess it depends a lot on the specific use case, will run tests next week.
Thank you for your insights
@Diet’s response was quite comprehensive, Ill run tests next week and update this entry with my findings :))
does anyone know if there’s a performance difference between the two models while holding vector dimensionality constant?