Create Text Embeddings with Length > 77

Hi Team,

I want to create embeddings on text with character length > 77 using Open AI Clip. Here is a test example code snippet.

text_array = ["A quick brown fox jumps over a lazy dog.", "The word count is the number of words in a document or passage of text. Word counting may be needed when a text is required to stay within certain numbers of words. This may particularly be the case in academia, legal proceedings, journalism and advertising. Word count is commonly used by translators to determine the price of a translation job. Word counts may also be used to calculate measures of readability and to measure typing and reading speeds (usually in words per minute). When converting character counts to words, a measure of 5 or 6 characters to a word is generally used for English."]

text_df = pd.DataFrame(text_array, columns =['Text'])
max_length_string = max(text_df['Text'].str.len())

import torch
from transformers import (
    CLIPTextConfig,
    CLIPTextModelWithProjection,
    AutoTokenizer,
)

PROJECTION_DIM = 512
MAX_POSITION_EMBEDDINGS = max_length_string + 1

textConfig = CLIPTextConfig.from_pretrained("openai/clip-vit-base-patch32")
textConfig.projection_dim = PROJECTION_DIM
textConfig.max_position_embeddings = MAX_POSITION_EMBEDDINGS

model = CLIPTextModelWithProjection.from_pretrained(pretrained_model_name_or_path="openai/clip-vit-base-patch32",
   config=textConfig,
   ignore_mismatched_sizes=True)

model.eval()
processor = AutoTokenizer.from_pretrained("openai/clip-vit-base-patch32")

inputs = processor(text=text_array, return_tensors="pt", padding=True)

# generate embeddings
with torch.no_grad():
    outputs = model(**inputs)
    embeddings = outputs.text_embeds
    embeddings = embeddings.cpu().detach().numpy().astype(np.float32)

However I am getting this warning below. Wanted to know what it means and if I can go ahead with this approach.

I want to ideally make this work if I want to send the text snippets in batches. Also I do not want to trim the text to 77 if possible.

Some weights of the model checkpoint at openai/clip-vit-base-patch32 were not used when initializing CLIPTextModelWithProjection: ['vision_model.encoder.layers.6.mlp.fc1.bias', 'vision_model.encoder.layers.1.self_attn.out_proj.weight', 'vision_model.encoder.layers.2.mlp.fc1.bias', 'vision_model.encoder.layers.2.layer_norm1.weight', 'vision_model.encoder.layers.6.self_attn.out_proj.weight', 'vision_model.encoder.layers.0.self_attn.v_proj.weight', 'vision_model.encoder.layers.6.self_attn.k_proj.bias', 'vision_model.encoder.layers.6.layer_norm2.bias', 'vision_model.encoder.layers.8.mlp.fc2.weight', 'vision_model.encoder.layers.3.self_attn.q_proj.bias', 'vision_model.encoder.layers.5.self_attn.v_proj.weight', 'vision_model.encoder.layers.4.self_attn.k_proj.weight', 'vision_model.encoder.layers.10.layer_norm2.bias', 'vision_model.encoder.layers.0.self_attn.q_proj.bias', 'vision_model.encoder.layers.0.self_attn.out_proj.weight', 'vision_model.encoder.layers.9.self_attn.out_proj.weight', 'vision_model.encoder.layers.2.layer_norm2.weight', 'vision_model.encoder.layers.7.self_attn.v_proj.weight', 'vision_model.encoder.layers.5.layer_norm1.weight', 'vision_model.encoder.layers.0.self_attn.q_proj.weight', 'vision_model.encoder.layers.3.self_attn.v_proj.weight', 'vision_model.encoder.layers.5.self_attn.k_proj.weight', 'vision_model.encoder.layers.10.layer_norm2.weight', 'vision_model.encoder.layers.9.self_attn.k_proj.bias', 'vision_model.encoder.layers.10.mlp.fc1.weight', 'vision_model.encoder.layers.5.layer_norm1.bias', 'vision_model.encoder.layers.11.self_attn.v_proj.bias', 'vision_model.encoder.layers.8.layer_norm2.weight', 'vision_model.encoder.layers.4.layer_norm1.bias', 'vision_model.encoder.layers.11.self_attn.q_proj.bias', 'vision_model.encoder.layers.4.mlp.fc2.bias', 'vision_model.encoder.layers.1.layer_norm2.bias', 'vision_model.encoder.layers.8.self_attn.q_proj.bias', 'vision_model.encoder.layers.11.self_attn.k_proj.bias', 'vision_model.encoder.layers.5.self_attn.v_proj.bias', 'vision_model.encoder.layers.2.layer_norm1.bias', 'vision_model.encoder.layers.1.self_attn.k_proj.bias', 'vision_model.encoder.layers.7.layer_norm2.weight', 'vision_model.encoder.layers.5.self_attn.q_proj.bias', 'vision_model.encoder.layers.0.mlp.fc1.bias', 'vision_model.encoder.layers.2.self_attn.v_proj.bias', 'vision_model.encoder.layers.3.self_attn.q_proj.weight', 'vision_model.encoder.layers.0.layer_norm1.weight', 'vision_model.encoder.layers.2.self_attn.q_proj.bias', 'vision_model.encoder.layers.8.self_attn.q_proj.weight', 'vision_model.encoder.layers.10.self_attn.q_proj.weight', 'vision_model.encoder.layers.2.layer_norm2.bias', 'vision_model.encoder.layers.11.self_attn.q_proj.weight', 'vision_model.encoder.layers.6.mlp.fc2.weight', 'vision_model.encoder.layers.2.self_attn.v_proj.weight', 'vision_model.encoder.layers.7.mlp.fc2.weight', 'vision_model.encoder.layers.0.layer_norm2.bias', 'vision_model.post_layernorm.weight', 'vision_model.embeddings.patch_embedding.weight', 'vision_model.encoder.layers.11.layer_norm1.weight', 'vision_model.encoder.layers.1.self_attn.out_proj.bias', 'vision_model.encoder.layers.5.mlp.fc1.bias', 'vision_model.encoder.layers.6.mlp.fc2.bias', 'vision_model.encoder.layers.0.self_attn.k_proj.bias', 'vision_model.encoder.layers.6.layer_norm1.weight', 'vision_model.encoder.layers.11.self_attn.k_proj.weight', 'vision_model.encoder.layers.8.mlp.fc1.weight', 'vision_model.encoder.layers.11.mlp.fc2.bias', 'vision_model.encoder.layers.1.layer_norm2.weight', 'vision_model.encoder.layers.8.self_attn.out_proj.bias', 'vision_model.encoder.layers.7.self_attn.out_proj.weight', 'vision_model.encoder.layers.4.mlp.fc1.weight', 'vision_model.encoder.layers.7.self_attn.q_proj.weight', 'vision_model.encoder.layers.9.self_attn.q_proj.bias', 'vision_model.encoder.layers.4.layer_norm2.bias', 'vision_model.encoder.layers.7.mlp.fc1.weight', 'vision_model.encoder.layers.0.self_attn.v_proj.bias', 'vision_model.encoder.layers.11.layer_norm2.weight', 'vision_model.encoder.layers.1.mlp.fc2.bias', 'vision_model.encoder.layers.7.self_attn.k_proj.bias', 'vision_model.encoder.layers.0.mlp.fc1.weight', 'vision_model.encoder.layers.7.mlp.fc2.bias', 'vision_model.embeddings.position_ids', 'vision_model.encoder.layers.5.layer_norm2.weight', 'vision_model.encoder.layers.10.self_attn.q_proj.bias', 'vision_model.encoder.layers.3.mlp.fc2.bias', 'vision_model.encoder.layers.4.mlp.fc2.weight', 'vision_model.encoder.layers.1.mlp.fc1.weight', 'vision_model.encoder.layers.8.layer_norm2.bias', 'vision_model.encoder.layers.11.mlp.fc1.bias', 'vision_model.encoder.layers.2.self_attn.out_proj.weight', 'vision_model.encoder.layers.9.mlp.fc2.weight', 'vision_model.encoder.layers.6.self_attn.k_proj.weight', 'vision_model.encoder.layers.6.layer_norm2.weight', 'vision_model.encoder.layers.8.self_attn.v_proj.weight', 'vision_model.encoder.layers.9.mlp.fc2.bias', 'vision_model.encoder.layers.3.self_attn.k_proj.bias', 'vision_model.encoder.layers.6.mlp.fc1.weight', 'vision_model.pre_layrnorm.weight', 'vision_model.encoder.layers.2.mlp.fc2.weight', 'vision_model.encoder.layers.8.self_attn.k_proj.weight', 'vision_model.encoder.layers.8.layer_norm1.bias', 'vision_model.encoder.layers.5.self_attn.k_proj.bias', 'vision_model.encoder.layers.4.self_attn.q_proj.bias', 'vision_model.encoder.layers.4.layer_norm2.weight', 'vision_model.encoder.layers.4.self_attn.q_proj.weight', 'vision_model.encoder.layers.8.self_attn.out_proj.weight', 'vision_model.encoder.layers.10.self_attn.out_proj.bias', 'vision_model.encoder.layers.1.self_attn.q_proj.weight', 'vision_model.encoder.layers.11.mlp.fc1.weight', 'vision_model.encoder.layers.1.layer_norm1.bias', 'vision_model.encoder.layers.0.mlp.fc2.bias', 'vision_model.encoder.layers.6.self_attn.q_proj.weight', 'vision_model.encoder.layers.9.self_attn.v_proj.weight', 'visual_projection.weight', 'vision_model.encoder.layers.7.self_attn.v_proj.bias', 'vision_model.encoder.layers.9.layer_norm2.weight', 'vision_model.encoder.layers.6.self_attn.v_proj.weight', 'vision_model.encoder.layers.3.self_attn.v_proj.bias', 'vision_model.encoder.layers.3.layer_norm2.weight', 'vision_model.encoder.layers.11.layer_norm2.bias', 'vision_model.encoder.layers.2.mlp.fc2.bias', 'vision_model.encoder.layers.9.self_attn.out_proj.bias', 'vision_model.embeddings.class_embedding', 'vision_model.encoder.layers.11.mlp.fc2.weight', 'vision_model.encoder.layers.10.layer_norm1.weight', 'vision_model.encoder.layers.3.mlp.fc2.weight', 'vision_model.encoder.layers.6.self_attn.v_proj.bias', 'vision_model.encoder.layers.8.self_attn.k_proj.bias', 'vision_model.encoder.layers.5.self_attn.out_proj.weight', 'vision_model.encoder.layers.5.self_attn.out_proj.bias', 'vision_model.encoder.layers.1.self_attn.v_proj.bias', 'vision_model.encoder.layers.10.mlp.fc2.weight', 'vision_model.post_layernorm.bias', 'vision_model.encoder.layers.1.layer_norm1.weight', 'vision_model.encoder.layers.1.self_attn.q_proj.bias', 'vision_model.encoder.layers.2.self_attn.k_proj.bias', 'vision_model.encoder.layers.11.self_attn.out_proj.weight', 'vision_model.encoder.layers.1.self_attn.k_proj.weight', 'vision_model.encoder.layers.4.self_attn.k_proj.bias', 'vision_model.encoder.layers.10.self_attn.k_proj.bias', 'vision_model.encoder.layers.2.self_attn.k_proj.weight', 'vision_model.encoder.layers.4.self_attn.v_proj.weight', 'vision_model.encoder.layers.0.layer_norm1.bias', 'vision_model.encoder.layers.7.self_attn.out_proj.bias', 'vision_model.encoder.layers.2.mlp.fc1.weight', 'vision_model.encoder.layers.8.layer_norm1.weight', 'vision_model.encoder.layers.3.layer_norm2.bias', 'vision_model.encoder.layers.7.self_attn.k_proj.weight', 'vision_model.pre_layrnorm.bias', 'vision_model.encoder.layers.10.mlp.fc1.bias', 'vision_model.encoder.layers.0.self_attn.k_proj.weight', 'vision_model.encoder.layers.11.self_attn.out_proj.bias', 'vision_model.encoder.layers.7.self_attn.q_proj.bias', 'vision_model.encoder.layers.1.mlp.fc2.weight', 'vision_model.embeddings.position_embedding.weight', 'vision_model.encoder.layers.3.mlp.fc1.weight', 'vision_model.encoder.layers.6.layer_norm1.bias', 'vision_model.encoder.layers.10.self_attn.v_proj.weight', 'vision_model.encoder.layers.10.layer_norm1.bias', 'vision_model.encoder.layers.4.self_attn.out_proj.weight', 'vision_model.encoder.layers.7.layer_norm1.bias', 'vision_model.encoder.layers.9.layer_norm2.bias', 'vision_model.encoder.layers.0.mlp.fc2.weight', 'vision_model.encoder.layers.7.layer_norm2.bias', 'vision_model.encoder.layers.10.self_attn.k_proj.weight', 'vision_model.encoder.layers.5.mlp.fc2.weight', 'vision_model.encoder.layers.3.self_attn.k_proj.weight', 'vision_model.encoder.layers.4.layer_norm1.weight', 'vision_model.encoder.layers.4.self_attn.out_proj.bias', 'vision_model.encoder.layers.9.self_attn.q_proj.weight', 'vision_model.encoder.layers.3.self_attn.out_proj.weight', 'vision_model.encoder.layers.11.layer_norm1.bias', 'vision_model.encoder.layers.5.layer_norm2.bias', 'logit_scale', 'vision_model.encoder.layers.3.layer_norm1.weight', 'vision_model.encoder.layers.9.layer_norm1.weight', 'vision_model.encoder.layers.6.self_attn.q_proj.bias', 'vision_model.encoder.layers.11.self_attn.v_proj.weight', 'vision_model.encoder.layers.5.self_attn.q_proj.weight', 'vision_model.encoder.layers.1.self_attn.v_proj.weight', 'vision_model.encoder.layers.1.mlp.fc1.bias', 'vision_model.encoder.layers.9.mlp.fc1.bias', 'vision_model.encoder.layers.0.layer_norm2.weight', 'vision_model.encoder.layers.10.self_attn.v_proj.bias', 'vision_model.encoder.layers.9.mlp.fc1.weight', 'vision_model.encoder.layers.10.self_attn.out_proj.weight', 'vision_model.encoder.layers.6.self_attn.out_proj.bias', 'vision_model.encoder.layers.10.mlp.fc2.bias', 'vision_model.encoder.layers.0.self_attn.out_proj.bias', 'vision_model.encoder.layers.4.mlp.fc1.bias', 'vision_model.encoder.layers.7.layer_norm1.weight', 'vision_model.encoder.layers.9.layer_norm1.bias', 'vision_model.encoder.layers.2.self_attn.out_proj.bias', 'vision_model.encoder.layers.8.self_attn.v_proj.bias', 'vision_model.encoder.layers.3.layer_norm1.bias', 'vision_model.encoder.layers.9.self_attn.k_proj.weight', 'vision_model.encoder.layers.2.self_attn.q_proj.weight', 'vision_model.encoder.layers.4.self_attn.v_proj.bias', 'vision_model.encoder.layers.8.mlp.fc2.bias', 'vision_model.encoder.layers.9.self_attn.v_proj.bias', 'vision_model.encoder.layers.8.mlp.fc1.bias', 'vision_model.encoder.layers.5.mlp.fc1.weight', 'vision_model.encoder.layers.7.mlp.fc1.bias', 'vision_model.encoder.layers.3.mlp.fc1.bias', 'vision_model.encoder.layers.5.mlp.fc2.bias', 'vision_model.encoder.layers.3.self_attn.out_proj.bias']
- This IS expected if you are initializing CLIPTextModelWithProjection from the checkpoint of a model trained on another task or with another architecture (e.g. initializing a BertForSequenceClassification model from a BertForPreTraining model).
- This IS NOT expected if you are initializing CLIPTextModelWithProjection from the checkpoint of a model that you expect to be exactly identical (initializing a BertForSequenceClassification model from a BertForSequenceClassification model).
Some weights of CLIPTextModelWithProjection were not initialized from the model checkpoint at openai/clip-vit-base-patch32 and are newly initialized because the shapes did not match:
- text_model.embeddings.position_ids: found shape torch.Size([1, 77]) in the checkpoint and torch.Size([1, 600]) in the model instantiated
- text_model.embeddings.position_embedding.weight: found shape torch.Size([77, 512]) in the checkpoint and torch.Size([600, 512]) in the model instantiated
You should probably TRAIN this model on a down-stream task to be able to use it for predictions and inference.
Token indices sequence length is longer than the specified maximum sequence length for this model (118 > 77). Running this sequence through the model will result in indexing errors

CLIP is an Image to Text model.

For text embeddings you should look at ada-002 in the API.

1 Like

Thank you! However, we wanted to convert image descriptions to embeddings and use them for our model training.

So if we use Open AI’s CLIP for images to embeddings and use ada 002 for text to embeddings and feed it in our model, I am wondering if it would be compatible to use two different embedding techniques for model training of text and images?

Or does it make sense to stick with CLIP for both?

Different embedding engines are generally not compatible. So if there is some text embedding feature in CLIP, that also relates to the image, then you need to use this to relate to the images as the CLIP model sees them.

If you just want to feed the embedding into another neural network, you can use whatever embedding model you want.

The nice thing about ada-002 is it has an extremely high input token count of 8k tokens. But for small image descriptions, you can use an open source embedding model, which the norm now is 512 tokens for the latest performant embedding models.

Also consider using lower dimensions if training your own network.

I’d need more context on what exactly you are doing with the embedding to give a less general answer.

Thank you Curt!

To provide you with a quick context: We are using image descriptions as text embeddings and we are feeding both text and image embeddings in the same neural network model. So based on your earlier response I believe creating both the embeddings using Open AI CLIP will work better.

Having said that, our text descriptions are usually very large that is why I resorted to adding a projection layer on my CLIP text model as mentioned in the code that sets the max_position_embeddings parameter to a length of say around 1000.

Hence when I ran the code, I was seeing the warning mentioned in my post. I wanted to understand what the warning message means and if what I am doing is the correct approach to create text embeddings for my model.

I have never used an embeddings from the CLIP model, but is this embedding coming from a latent space inside the CLIP model?

The warning says that “You should probably TRAIN this model on a down-stream task to be able to use it for predictions and inference.”

So if your pipeline is

Image → Text → Vector → Another Text

Where you get Another Text from training your own neural network, then it doesn’t matter much which embedding model, you just need Text (from Image using CLIP) to be able to fit into the context window of the embedding model.

But if we are talking

Image → Text → Vector → Another Text → Another Vector → Image

You need to relate Vector and Another Vector to CLIP using their embeddings, most likely, if there is a Vector → Image functionality in CLIP, and presuming there exists a Text → Vector feature also in CLIP.

Hi Curt,

My pipeline is more like:

Image1 → Embeddings1
Text1 → Embeddings2

Embeddings 1 + Embeddings 2 → Model1

The Image1 has the text description as Text1 but we are creating Embeddings1 and Embeddings2 separately. However I plan to club Embeddings1 and Embeddings2 into Model1 downstream if that makes sense.

Please Note: Image1 doesn’t generate Text1 and vice versa in our model. We just know that text1 explains image1 and they are independent of each other as far as the pipeline is concerned.

1 Like

OK, then if both are independent, you are basically feeding two embeddings into your model (one for the image, and one for the text representation if the image) and then you are training your new model with these two inputs.

So for this, I suppose use the best embedding engines you can for each.

So on the text side, while ada-002 has a high allowed input buffer of 8k, it also has a narrow cone of vectors in it’s output, so the vectors are going to be close coming out of this, maybe messing with your ability to distinguish things in your model, not sure.

Also, the output dimensions of ada-002 are fairly large at 1536, so this could increase your training and inference times.

But otherwise, it looks feasible to just feed both vectors (are you just concatenating them?) into the new model for training.

1 Like

Yes we are concatenating them. That is where the question of compatibility of the two embedding engines comes in.

A quick follow up to the above comment. If we decide to use open AI clip is the warning message expected? And would this be a correct approach of implementing the text embeddings creation?

I think concatenating is fine. There is probably some smarter person that would say otherwise :rofl:

When you concatenate you don’t get a unit vector back, but I don’t think that is an issue with neural net inputs. I’m not even sure if it’s an issue if they have different scales, but I might be paranoid and make sure the numbers in the input layer have the same relative max/min in their distributions.

Otherwise the “left” neurons will have different gradients than the “right” neurons. Is this an issue? Maybe, but maybe not.

Try concatenation, especially if the scales are the same.

If the dimensions are vastly different, then this could be another issue. So if your text dimension is 1536, and your image dimension is 512, your image is underrepresented with concatenation. In this case maybe have the 512 image side run through a few more hidden layers and/or grow to more neurons on that side to beef it up to be compatible to the text side.

So, I guess check the dimensions and ranges within the vectors, balance them out by scaling, or increasing/decreasing neurons and hidden layers. Maybe it does’t matter either! Complicated, but fun problem you have :rofl:

1 Like

Hi Curt!

Thank you for the information. I guess in conclusion I would like to use CLIP to generate text embeddings so I want to check with you on the original question that I have.

“Can I use the projection layer to bypass the max token length as I have done in the code above?”

Thank you!

Honestly, I’m not familiar with the projection layer in the CLIP model. Any idea what this is, and why do you want to bypass the max token length?

I remember you said the descriptions were too long in some cases. But why not just truncate the data before the embedding, how bad does it get?

Otherwise you can try embedding multiple chunks and average the vectors to get the embedding representing the entire chunk.

Okay. Will do that approach. Thank you!

1 Like