Is CLIP used when the input is an image and text? Or are Ada-3 variants if it’s text only? Or is that yet to be publicly disclosed internal embedding models?
Is CLIP used when the input is an image and text? Or are Ada-3 variants if it’s text only? Or is that yet to be publicly disclosed internal embedding models?