What is the embedding model used by GPT4-o and o1-preview? Does this vary depending on the modality?

Is CLIP used when the input is an image and text? Or are Ada-3 variants if it’s text only? Or is that yet to be publicly disclosed internal embedding models?