Is retraining for documents required, if change to latest embedding models?

Currently, we have embeddings trained by ADA, stored in our vector db, we are planning to update the embedding model to text-embedding-small-1536.

Do we need to retrain our entire corpus due to a change of the embedding model?

We found that open AI uses the same tokeniser but both models ADA and text-embedding-small-1536 have different architecture.

if we create embedding from one model for the doc chunks and create embedding for another model for query chunks and use cosine similarity then how will cosine similarity play its role?

please assist…

1 Like

ada-002 won’t be going away, so there is no need to migrate away from that model.

However, you will need to re-embed all if you want to use a different model for other reasons. They are not compatible, even though 3-small has the same number of dimensions.

Embeddings will have completely different values and semantics. Use of new models will require new tuning of thresholds you may have been using.

1 Like