Image + Text Embedding (options present and future)

Is there a method to embed text and images into one chunk? I imagine that the multimodal version of GPT-4 will yield a version of this? Can you just stack text embedding with image embedding? I want to capture semantics of references between the text and image. Will openai offer this?

