Get embeddings for images

Hi there,

is there a way to get the embeddings for images via the API?
I would like to store them in my vectordb but don’t want to mix embedding calculation, if not possible i have to calculate own embeddings for text and images.

Kind regards

kindof

3 Likes

Hi!
Welcome to the community!
OpenAI’s text embeddings measure the relatedness of text strings.
If you want to create embeddings for images you need to use another model. You can check huggingface as reference. Here is a link to get you started:

1 Like

Hi vb,

was hoping to avoid this :sweat_smile: but thanks for the HF link!

Kind regards

kindOf

1 Like

You can look for image embeddings models at Replicate. Here’s one of them https://replicate.com/daanelson/imagebind. We use it to detect deep fakes at kazimir.ai.

For image embeddings, I am using Titan Multimodal Embeddings Generation 1, available via API in AWS.

It’s working good for me so far at classifying images, by correlating to previously labeled images, and determining the best fit label for the image.

You can also mix text and the image together (mutimodal), but I am using it without text to get a raw image embed.

As it stands, there is no direct image embedding model from OpenAI. The closest you can get is use GPT-4V to generate a text description of the image, and then embed the text. But this is too much compression for my use case, and not cheap either.

3 Likes

Does the image embeddings work well along side text embeddings? A common use case is to do RAG retrieval on documentation with screenshots in it.

Very often the screenshots contain critical information lost for text embedding only. How good is the image embedding if user were query relevant information in the screenshot. Say UI configurations?

GPT-4V sounds like promising workaround but I haven’t tried it performance yet for RAG applications

1 Like

Maybe query GPT4 vision to describe the image in as much detail as it can, and then use that text to create an “embedding” of the caption for that image. And whenever you use RAG, if the embedding of the image pops up, substitute it for the image and send it along with the response.

3 Likes

Did you get an answer for this? @kingsframe

I have the same idea. Use GPT4 to get description with visual, cultural, contextual and semantic meanings (if possible) and feed that to the embedding API.

Then do the same for query strings and match embeddings in a vector database to pull best results.