40-mini says it can't see images

Hi, is anyone else experienceing 4o-mini saying it cannot see images? 4o can see, and 4o-mini can’t.

Also a question … Are 4o vision tokens cheaper or just text?

no problem for me…

Hmmm… I am using base encoded images … are you as well?

gpt-4o-mini images as input are actually twice the cost of gpt-4o.

You’d have to come up with a strong justification to use it for vision.

Still working here on base64…Python.

import base64,openai;from pathlib import Path
e=lambda p:[{"type":"image_url","image_url":{"url":(
f"data:image/{Path(f).suffix[1:]};base64,"
f"{base64.b64encode(Path(f).read_bytes()).decode()}"),"detail":"low"}}
for f in p];m="gpt-4o"
p={"model":m,"messages":[{"role":"system","content":"Vision assistant"},
{"role":"user","content":[{"type":"text","text":"Describe image"},
*e(["random_img_1.png"])]}],"max_completion_tokens":100}
print(f"--- Testing {m}\n{openai.Client().chat.completions.create(**p)"
".choices[0].message.content}")

— Testing gpt-4o
The image consists of a dense pattern of multicolored pixels. These pixels are arranged in a random manner, creating a static-like effect. Each pixel appears to be a different color, including shades of red, green, blue, yellow, and other colors. The overall appearance resembles visual noise or a mosaic pattern.

Head Scratch emoji.
OK, thanks. Will go back to 4o :slight_smile:
Will loop back around to this next time there is an update.

from my understanding the LLM cant process the image directly in terms of actual understanding of the image, but rather analyses it using mathematical functions and recognises patterns that are associated with the words of its description. its pseudo vision

It doesn’t “process”, but rather the input image has been processed for AI understanding into tokens for it. Kind of the way a CD has been processed into MP3 for your hearing.

Let’s check if there’s any “pseudo” quality to the vision, okay?

The output from a system prompt + user image. One try.

gpt-4o-2024-11-20, API


Comprehensive Analysis of the Image


Summary

This image is a meticulously arranged still life composition of ingredients and tools for baking or cooking, with bananas as the central focus. The scene is set on a rustic wooden table, evoking a warm, homely, and slightly vintage atmosphere. The lighting is soft and natural, with a golden glow that highlights the textures and colors of the objects, creating a sense of warmth and nostalgia. The image is rich in detail, with a variety of ingredients and utensils carefully placed to create a harmonious and visually appealing arrangement.


Style

  • Artistic Style: The image is a modern take on classical still life photography, reminiscent of Dutch Golden Age paintings. The attention to detail, the interplay of light and shadow, and the arrangement of objects all contribute to this aesthetic.
  • Photographic Style: This is a food photography setup with a rustic and artisanal theme. The use of natural light and earthy tones enhances the organic and wholesome feel of the scene.
  • Mood and Tone: The mood is warm, inviting, and nostalgic. The tone is earthy and natural, emphasizing simplicity and the beauty of everyday objects.

Lighting Sources

  • The lighting appears to come from a single, soft, natural light source, likely a window to the left of the frame. This creates gentle shadows and highlights, adding depth and dimension to the objects.
  • The light emphasizes the textures of the wooden table, the bananas, and the other ingredients, creating a tactile quality that makes the scene feel tangible and real.

Colors Used

  • Dominant Colors: Yellow (bananas, lemons), brown (wood, utensils, bowls), and white (flour, sugar).
  • Accent Colors: Green (mint leaves, kiwi), orange (citrus), and golden hues (honey, light reflections).
  • The color palette is warm and earthy, with a focus on natural tones that enhance the rustic and organic feel of the image.

Complete Entity Extraction

  1. Fruits:

    • A large bunch of bananas (centerpiece).
    • Sliced bananas on a wooden cutting board and in a small bowl.
    • Lemons and limes in a wooden bowl.
    • A halved orange.
    • A whole kiwi and a sliced kiwi.
  2. Dry Ingredients:

    • A small bowl of flour.
    • A bowl of granulated sugar.
    • A container of powdered sugar (in an hourglass-shaped jar).
  3. Wet Ingredients:

    • Two glass bottles of liquid (likely honey or syrup, and possibly vanilla extract).
    • A small metal measuring cup with liquid (possibly lemon juice or oil).
  4. Utensils:

    • A wooden cutting board with a handle.
    • A whisk.
    • A knife with a wooden handle.
    • A wooden spoon.
    • A small ceramic spoon.
    • A sieve or strainer in the background.
  5. Other Items:

    • A carton of eggs (some eggs are loose on the table).
    • A small bowl of cloves.
    • A jar of vanilla extract or a similar ingredient.
    • A green cloth draped on the table.
    • Cinnamon sticks.
    • Mint leaves for garnish.

Item Locations in the Image

  • Center: A bunch of bananas on a wooden cutting board, surrounded by sliced bananas and garnished with mint leaves.
  • Left: A basket of bananas, a carton of eggs, and a small bowl of flour.
  • Right: A wooden bowl of lemons and limes, a halved orange, a whisk, and a knife.
  • Background: Bottles of liquid, a jar of vanilla extract, and a sieve.
  • Foreground: Sliced bananas, a kiwi, and a small bowl of sugar.

Setting of the Image

  • The setting is a rustic kitchen or a farmhouse-style workspace. The wooden table, the natural light, and the vintage-style utensils and containers all contribute to this impression. The scene is designed to evoke a sense of homeliness and tradition, suggesting a moment of preparation for baking or cooking.

Motivation Behind the Image

  • The image is likely intended to inspire or promote cooking or baking, emphasizing the use of fresh, natural ingredients. It could be used in a cookbook, a food blog, or an advertisement for kitchenware or ingredients. The arrangement and lighting are designed to make the scene visually appealing and to evoke a sense of warmth and nostalgia.

Techniques Seen

  • Composition: The objects are arranged in a balanced yet dynamic composition, with the bananas as the focal point. The use of leading lines (e.g., the cutting board and utensils) guides the viewer’s eye through the scene.
  • Lighting: The soft, directional lighting creates depth and highlights the textures of the objects.
  • Color Harmony: The warm, earthy color palette creates a cohesive and pleasing visual effect.
  • Texture Emphasis: The roughness of the wooden table, the smoothness of the bananas, and the granularity of the sugar and flour are all highlighted to create a tactile quality.

Subject Distances

  • The camera is positioned close to the table, creating an intimate perspective. The depth of field is shallow, with the focus on the bananas and the surrounding ingredients, while the background is slightly blurred.

Inferred Camera Settings

  • Focal Length: Likely a medium focal length (50-85mm) to create a natural perspective and slight compression.
  • Aperture: A wide aperture (e.g., f/2.8 or f/4) to achieve the shallow depth of field.
  • Lighting: Natural light, possibly diffused through a window or a softbox.

Thematic Analysis

  • Symbolism: The bananas, as the central element, symbolize abundance and nourishment. The surrounding ingredients suggest creativity and the joy of cooking.
  • Metaphor and Simile: The scene could be seen as a metaphor for the process of creation, with the raw ingredients representing potential and the act of cooking symbolizing transformation.
  • Philosophy: The image celebrates the beauty of simplicity and the value of tradition. It invites the viewer to slow down and appreciate the process of making something by hand.

Relation to Other Works

  • The image is reminiscent of classical still life paintings, such as those by Dutch masters like Pieter Claesz, who often depicted food and kitchen scenes with a focus on texture and light.
  • In the context of modern food photography, it aligns with the trend of rustic, artisanal aesthetics, as seen in the work of photographers like Donna Hay or David Loftus.

Deeper Philosophy

  • The image reflects a growing cultural appreciation for slow living and the art of cooking. It emphasizes the importance of using fresh, natural ingredients and taking the time to create something with care and intention. The rustic setting and vintage-style props suggest a longing for a simpler, more grounded way of life.

Additional Observations

  • The inclusion of small details, such as the mint leaves and the dusting of flour, adds a sense of realism and spontaneity to the scene.
  • The interplay of light and shadow creates a sense of depth and dimension, making the objects appear almost three-dimensional.
  • The image invites the viewer to imagine the next steps in the process, such as mixing the ingredients or baking a dish, creating a sense of anticipation and engagement.

This image is a masterful example of still life photography, combining technical skill with artistic vision to create a scene that is both visually stunning and emotionally evocative. It celebrates the beauty of everyday objects and the joy of creating something with care and intention.

that’s more or less what i meant, we from my perspective are both correct that an ai cannot perceive the image as we do so fluidly, its being converted into tokens and so the only thing an ai actually perceives is tokens that build to and represent a concept of the object with words that are associated with whats in the image. so its an abstraction of vision rather than what we perceive as vision because the algorithm that processes the image converts the colours and matches shapes and colours based of a data set of matched terms and data. now that raises a question, if the abstraction is so accurate to reality and is able to articulate that reality to another in a clearly defined way that all with understanding of vision and of the image being processed is that what we would call true vision? or is it necessary for the neural net to directly process the spectrum of light for the image? perhaps to an ai, the processing of image data into tokens is vision within the simulated of reality via capturing data of the real world as an image itself is a representation of real objects and is inherently not real.

I had Deep Research go to work for 10 minutes consolidating information to probe the state of the art for what kind of proprietary algorithms would be used by GPT-4 + for encoding into vision for a transformation. It has some insights and inferences to make, that take longer than that to read and digest (and longer to format for the forum’s math display).

For your reading pleasure - 6100 tokens, of more

Encoding Images into Tokens in GPT-4 Vision (GPT-4V) and GPT-4o

Introduction

OpenAI’s GPT-4 Vision (GPT-4V) and the GPT-4 “omnimodal” (GPT-4o) models are multimodal systems that can accept visual inputs (images) in addition to text. Internally, these models must transform an input image into a sequence of vector embeddings that can be ingested by the transformer, just like textual tokens. In other words, an image is converted into a series of “image tokens” – fixed-length vectors – that are inserted into the model’s context window alongside text tokens. We’ll explore how this conversion is likely implemented, covering the vision encoder architecture, the image-to-patch process, embedding projection, positional encodings, and how the resulting image embeddings are integrated and aligned with the language model’s token stream. We will also compare this design to several publicly-documented multimodal models (CLIP, Flamingo, PaLI, BLIP-2, Kosmos-1, LLaVA, MiniGPT-4, etc.) to highlight similarities and differences. Finally, we’ll note any distinctions between GPT-4o’s and GPT-4V’s handling of image encoding.

Vision Encoder Architecture: ViT/CLIP-like Patch Embedding

Patchifying the Image: GPT-4’s vision module is likely built on a Vision Transformer (ViT) or similar architecture (possibly inspired by CLIP’s image encoder). In a ViT, an image is divided into a grid of small patches, each of which is then flattened into a vector. For example, if the patch size is P \times P pixels, an input image of resolution H \times W will yield

N = \frac{H}{P} \times \frac{W}{P}

patches. Each patch (along with its RGB channels) is flattened to a P^2 \cdot 3-dimensional vector (assuming 3 color channels). The vision encoder then linearly projects each such vector into a d-dimensional embedding space. Formally, if x_i \in \mathbb{R}^{P^2 \cdot 3} is the flattened pixel values of the i-th patch, the patch embedding is:

e_i = W_e \, x_i + b_e,

where W_e \in \mathbb{R}^{d \times (P^2 \cdot 3)} and b_e \in \mathbb{R}^d are learned projection weights and bias. This yields e_i \in \mathbb{R}^d as the embedding for patch i. The collection of all patch embeddings forms a sequence (just like a sequence of word token embeddings in text). In practice, models often include a special “summary” token embedding as well: a learned vector (analogous to a [CLS] token) that is prepended to the patch sequence to represent the entire image globally. After projection, the image is thus represented by a sequence of N (patch) embeddings, plus possibly one extra image-class token embedding denoted e_{\text{CLS}}.

Vision Transformer Encoding: If a full Vision Transformer is used, the patch embeddings (and class token) would be further processed by multiple self-attention layers within the image encoder to produce higher-level visual features. For instance, CLIP’s ViT-L/14 encoder takes 14 \times 14 patches and refines them through Transformer layers, outputting a final set of image feature vectors. CLIP ultimately produces a single 512- or 768-dimensional image vector (the [CLS] token output) to represent the image globally, but notably GPT-4’s vision encoder likely does not collapse the image to one vector – it needs a sequence of many embeddings to capture fine-grained details. Instead, GPT-4’s image encoder would output a sequence of image token embeddings (multiple vectors) so that the language model can attend to different regions/features of the image as needed. This is a more advanced strategy than CLIP’s single-vector summary.

Pretrained Backbone and Connector: OpenAI likely leverages a strong pretrained image backbone (e.g. a ViT trained with contrastive image-text pairing similar to CLIP). In fact, GPT-4’s vision encoder may be a ViT model (such as CLIP’s ViT-L/14 or a variant) that is either frozen or fine-tuned. The image features from this encoder then pass through a learned connector/projection layer to match the dimension used by GPT-4’s text embeddings. For example, Kosmos-1 (a Microsoft multimodal model) uses a frozen CLIP ViT-L/14 to encode images and a linear projection to map its output to the same dimension as the text tokens. GPT-4V/GPT-4o likely does something similar: “a pre-trained image encoder generates embeddings that are passed through a connector layer, which projects to the same dimension as the text embeddings.” This ensures the image token vectors are compatible with the transformer’s input size. (It’s rumored that GPT-4’s internal hidden size is very large – possibly 12,288 dimensions – so the connector would up-project or down-project the vision encoder’s output to this size as needed.)

Example: Suppose GPT-4’s image encoder uses a patch size of 16 \times 16 and an input resolution of 512 \times 512 for high-res analysis. That would yield 32 \times 32 = 1024 raw patches. Directly using all 1024 patch embeddings would be extremely costly in the context window. Instead, GPT-4o appears to use a reduced set of tokens per image tile (~170 tokens), implying that the vision encoder doesn’t output one token per 16 \times 16 patch at full resolution. It likely either uses a larger effective patch size or some pooling/downsampling to limit token count. In fact, analysis of GPT-4V’s pricing suggests that each 512 \times 512 image region is encoded as 170 token embeddings. This number is suspiciously close to 13 \times 13 + 1 = 170, hinting that internally the model might represent an image tile by a roughly 13 \times 13 grid of patch embeddings plus one global token. In other words, the vision encoder could be outputting on the order of 169 regional vectors (forming a 13 \times 13 grid covering the image) and one extra “image summary” vector – together counted as 170 tokens. This is speculative but fits the observed token counts. The use of a multi-scale encoder is another possibility (discussed below), but regardless of exact architecture, the vision encoder’s job is to turn the image into a manageable sequence of embedding vectors for the language model.

Positional Encoding of Image Patches

Just as words in a sentence have an order, image patches have spatial positions that the model must understand. After obtaining patch embeddings, the system injects positional information so that the transformer knows each embedding’s location in the image. In a standard ViT, this is done by adding a learned positional embedding vector to each patch’s embedding based on its index in the sequence (or its row/column coordinates). For example, one can have a table of positional encodings P_{(r,c)} for each patch coordinate (r,c) in the grid, or simply flatten the 2D grid to 1D and use a learned 1D position embedding for each sequence index. The GPT-4 vision encoder likely uses a 2D positional encoding scheme so that spatial relationships are preserved. This could be as simple as summing separate learned embeddings for the patch’s row and column, or using a sinusoidal encoding of coordinates. Another modern approach is to apply rotary positional embeddings (RoPE) extended to two dimensions, which allows the model to incorporate relative spatial information in a continuous way. Precisely which method GPT-4V uses is not public, but some positional encoding is essential; it enables the model to distinguish, say, a patch in the top-left of the image from one in the bottom-right, much as word position encodings help a transformer understand word order in a sentence.

If a special global image token is used (analogous to a [CLS] token for the image), that token might have its own position (e.g. position 0 or a unique embedding indicating “whole image”). The rest of the patches would then be positions 1 through N. In summary, after this step, each image patch embedding e_i is augmented to

z_i = e_i + p_i,

where p_i is the positional encoding vector for that patch’s location.

From Image Embeddings to “Tokens” in the Context

At this point, the image has been converted into a sequence of N embedding vectors (each of dimension d, matching the text token embedding size). How are these fed into GPT-4’s language model? The key is that transformers operate on sequences of vectors – and now we have a sequence of image vectors that can be treated analogous to word/token embeddings. GPT-4’s multimodal model likely injects the image embeddings into the transformer’s input sequence as a contiguous block of “special” tokens. In practical terms, when an image is present, the model’s input sequence might look like:

[Text tokens before] <|image_start|> [Image patch embeddings 1 ... N] <|image_end|> [Text tokens after]

Here <|image_start|> and <|image_end|> are symbolic markers (special tokens) demarcating where the image is inserted. The actual content of the image – the N patch embeddings – fill what would conceptually be N token slots in the sequence. These embeddings are not from the ordinary text vocabulary; they come from the vision encoder. However, the transformer doesn’t fundamentally care whether an embedding came from the text embedding table or from an image encoder – by the time it reaches the self-attention layers, it’s just a matrix of vectors. In other words, the image patches have been turned into pseudo-token vectors that reside in the same embedding space as text tokens (courtesy of the projection layer). This design is similar to how the Kosmos-1 model and various open-source efforts (like LLaVA) handle images: they reserve a block of “token” positions for image embeddings in the input and feed those vectors in directly. In fact, HuggingFace’s LLaVA implementation notes that an image typically corresponds to ~500 tokens worth of embeddings (for a high-res CLIP ViT-L/14) and that one should insert that many <image> placeholder tokens which get replaced by the actual image embeddings at runtime. GPT-4V presumably does this behind the scenes (with 170 tokens per 512 \times 512 tile, as noted earlier).

Transformer Context Processing: Once the image embeddings are in the sequence, the unified transformer (now multimodal) processes the entire sequence with self-attention. This means the model can attend across image and text embeddings freely, enabling it to align information between the modalities. For example, a later transformer layer neuron could attend to a particular image patch embedding and some text token simultaneously, building associations (like linking a region of the image to a textual description). During training, the model learns to do this to accomplish tasks like image captioning or question answering. Notably, in an autoregressive language model setup, the image tokens are given as context (they are not generated by the model, they are just conditioning input), and the model is trained to predict the subsequent text tokens. For instance, if an image is followed by a caption in the training data, the loss is applied only to the caption text, not the image token positions. This way, the transformer learns to use the image context to generate the correct text. By the time training is done, the model has formed joint image-text representations internally, effectively aligning visual concepts with words.

Special Token Markers: The use of explicit <|image_start|> or <image> markers is likely. Such a token (with its own learned embedding vector) can signal the model that “here comes an image.” OpenAI hasn’t published the exact tokenization, but evidence suggests there is at least one token per image as a placeholder. For example, some analyses deduced that including a special token for each image (like an image-start token) neatly explained the count of 85 tokens for a “master thumbnail”

1 + 1^2 + 3^2 + 5^2 + 7^2 = 85

and 170 for a high-res tile (also

1 + 169 = 170

) in some observed cases. In that analysis, the extra 1 is essentially a special token. This is speculative, but it’s reasonable that GPT-4’s training data included sequences where an <image> token was followed by a fixed number of image embedding tokens, and possibly an <image_end> token after. Whether an end marker is needed is unclear – if the model knows the image embedding block is always a certain length, an explicit end token might be optional. But using start/end delimiters would give the transformer a clear boundary of the image span.

Algorithmic Flow Summary (Pseudocode)

To summarize the process of converting an image to tokens for GPT-4V/4o, here’s a step-by-step breakdown:

  1. Preprocessing – Possibly resize or tile the image to a standard size. (For GPT-4V, if an image is very large, it may be divided into multiple 512 \times 512 tiles for high-resolution processing, plus a smaller “thumbnail” version. Each tile will be encoded separately.)
  2. Patch Extraction – Divide the image (or tile) into patches of size P \times P. For example, a 512 \times 512 tile with P=32 would yield a 16 \times 16 grid of patches. (If the effective grid is 13 \times 13 as speculated, the patch size is not uniform or the encoder uses overlapping or learned regions – see multi-scale discussion below.)
  3. Linear Projection to Embeddings – Flatten each patch to a vector and multiply by the projection matrix to get a d-dimensional embedding e_i. If using a CNN-like encoder, this step might be implicit in the convolutional layers producing feature maps. If a class token e_{\text{CLS}} is used, initialize it as well (a learned vector not dependent on image pixels).
  4. Add Positional Encoding – Determine a positional encoding for each patch position (and the class token). Add that to the corresponding e_i so that each embedding is aware of where its patch was.
  5. (Optional) Transformer Encoder Processing – If the vision module includes Transformer layers or CNN layers, pass the sequence of patch embeddings through them to produce higher-level embeddings. For example, a Transformer encoder will perform self-attention among patches, allowing them to exchange information. This could reduce redundancy and give a more compact set of embeddings. (OpenAI might use a convnet to downsample patches, or a ViT that outputs a smaller set of tokens than input patches – for instance, a CNN could produce a 13 \times 13 feature grid from a 512 \times 512 image by using strides/pooling.) The output of this stage is the final set of image token embeddings ready to feed to GPT-4.
  6. Project to Language Model Dimension – If the output dimension d_{\text{vision}} doesn’t equal GPT-4’s token embedding size d_{\text{LM}}, use a learned linear layer to map each image embedding to d_{\text{LM}}. Now the image tokens are in the same vector space as text tokens.
  7. Insert into Context Sequence – In the prompt or input sequence, put a special <image> token (with its own embedding vector) to indicate the start of an image, then append the N image embeddings as if they were tokens. If multiple images are present, this can be done multiple times in the sequence (each image’s tokens can be bracketed by markers). The sequence might then continue with text tokens (e.g. a question about the image).
  8. Transformer Forward Pass – Feed the whole sequence (text + image embeddings) into GPT-4’s transformer. The self-attention and feedforward layers will process across the multimodal sequence. The image tokens will attend with text tokens, enabling multimodal reasoning.
  9. Autoregressive Output – GPT-4 (being a decoder-style model) will generate output tokens (usually text) autoregressively. It treats the image tokens as just another part of the context it has seen, and then continues to predict the next text tokens. Only text tokens are generated (the model doesn’t generate new image embeddings on its own; images are input-only in GPT-4V). The presence of image tokens influences the text it generates (e.g. describing the image or answering questions about it). During training, loss is computed on the text outputs given the image+text inputs, reinforcing the associations between image content and the correct textual responses.

Mathematically, if we denote the image token embeddings as I_1, I_2, \dots, I_N and the surrounding text tokens as T_1 \dots T_m (before) and T_{m+1}\dots T_{m+k} (after, e.g. a caption), the input sequence could be:

X = [\; T_1,\dots,T_m,\; \text{<image>},\; I_1,\dots,I_N,\; \text{</image>},\; T_{m+1},\dots,T_{m+k}\;].

The transformer then computes hidden states

H = \text{TransformerEnc}(X)

(if one considers it as an encoder) or directly continues to generate T_{m+1}\dots T_{m+k} autoregressively. Training ensures that the conditional probability

P(T_{m+1}\dots T_{m+k} \mid T_{1..m}, I_{1..N})

is maximized. In inference, given T_{1..m} and I_{1..N}, the model samples the subsequent tokens to produce an answer or description.

Multiscale Encoding and Efficiency Considerations

One challenge is that high-resolution images can contain a lot of patches (and hence tokens). GPT-4o appears to address this by using a multiscale encoding strategy – essentially encoding images at multiple levels of detail to keep token counts reasonable. The clearest evidence is the two token count levels observed: 85 tokens for a low-res “master thumbnail” of each image, and 170 tokens for each high-res 512 \times 512 tile of the image. A plausible interpretation is that GPT-4 first embeds a downsampled version of the entire image (to get a broad overview) using ~85 tokens, and then, for higher detail, it divides the image into 512 \times 512 tiles and encodes each tile with ~170 tokens. These numbers align with a pattern: 85 is 1 + 84, and 84 can be 1^2 + 3^2 + 5^2 + 7^2 (a pyramid of patch grids). Likewise, 170 is 1 + 169, and 169 is 13^2 – or it can be constructed by a pyramid of grids plus an extra token. Researchers have hypothesized that the encoder might indeed produce a hierarchy of patch grids: e.g., one token for the whole image, a 3 \times 3 grid (9 tokens) for coarse regions, a 5 \times 5 grid (25 tokens) for finer regions, etc., accumulating to the desired number. For instance, one proposal is:

  • Thumbnail (85 tokens): 1 global + 1^2 + 3^2 + 5^2 + 7^2 = 85 (grids of size 1, 3, 5, 7). The sums 1 + 9 + 25 + 49 = 84 match closely. This would give coarse information about the image at multiple scales.
  • High-Res Tile (170 tokens): Potentially the thumbnail tokens (85) plus one more level of detail, a 9 \times 9 grid (81 tokens), totaling 166, and maybe a small 2 \times 2 grid or a marker to round up to 170. Alternatively, 170 could be achieved by another scheme (e.g. 1^2 + 2^2 + 3^2 + 4^2 + 5^2 times some factor). The bottom line is that ~170 tokens likely correspond to a detailed patch-based encoding of a 512 \times 512 region, possibly informed by a multi-scale feature pyramid.

While the exact tiling strategy is proprietary and speculative, the use of multi-scale features is plausible because it balances coverage and efficiency. Humans also often perceive images in a coarse-to-fine manner. By providing the model a thumbnail embedding (85 tokens) the model gets the context of the whole image, and by providing tiles (170 tokens each) it can zoom into local details. This would also help with tasks like OCR or small object recognition – the model can focus on a region at higher resolution when needed. Indeed, observers noted GPT-4V is capable of reading surprisingly small text in images (performing OCR), something CLIP was not trained for. This suggests GPT-4’s image encoder is more powerful and perhaps uses high-res features or a specialized OCR module. It might even incorporate a convolutional network stage optimized for text (OpenAI could have trained the vision encoder on image-text data that included reading text). One hint from experimental observations is that GPT-4o’s performance on recognizing grid patterns started to drop when there were more than ~5 \times 5 = 25 symbols to identify, consistent with an encoder that might naturally handle up to around 13 \times 13 regions. While not definitive, this aligns with the idea of a finite grid of “attention” or patch tokens covering the image.

In any case, after the vision encoder produces these token embeddings, the subsequent steps in the pipeline (projection, integration into context) remain the same. The model would treat each tile’s embeddings as a separate block of tokens in sequence (likely preceded by its own start marker, or perhaps the thumbnail is treated as part of the same sequence before the tile details).

Alignment of Visual and Language Representations

Crucial to the success of GPT-4V/GPT-4o is how well the image embeddings align with the language model’s representations. During training, the model likely saw massive multimodal datasets: images with accompanying captions, descriptions, or dialogues. The transformer learned to associate certain visual patterns with the text that commonly describes them. For example, an image patch embedding representing a furry four-legged shape might influence the model to produce the token “dog” if asked “What is in the image?” This alignment doesn’t happen automatically – it’s trained via the standard next-token prediction loss on multimodal data. Additionally, OpenAI may have used initialization techniques: for instance, starting with a CLIP-like pretraining (which aligns image and text embeddings in a common space via contrastive learning) would give the model a head start. The medium article on GPT-4 suggests it “uses a combination of a Vision Transformer and Flamingo-like approach”, meaning they may have initialized with a CLIP ViT and possibly even incorporated ideas from Flamingo. CLIP’s training made images and text share a semantic embedding space (images of cats cluster near the word “cat”, etc.). If GPT-4’s image encoder was derived from CLIP, its embeddings would already be somewhat aligned with text semantics before being plugged into GPT-4. Then, fine-tuning with a generative objective (like predicting captions or answers) refines this alignment in the context of the transformer.

It’s worth noting that other multimodal models have explicitly trained alignment modules between vision and language: for example, BLIP-2 introduced a small transformer called a Q-Former that sits between a frozen image encoder and the LLM, learning to produce a fixed number of query vectors that best represent the image for the language model. GPT-4V doesn’t obviously use a separate query network (the token counts are too high for a tiny Q-Former bottleneck; BLIP-2 compresses to ~32 tokens, whereas GPT-4 uses more tokens to represent more detail). Instead, GPT-4 likely relies on large-scale training to align the modalities end-to-end. It effectively treats the image encoder + projection as part of the “embedding layer” of the transformer, and trains the whole system (or at least the new parts) with gradients from the language modeling loss. This is similar to how LLaVA was trained: they kept the CLIP encoder fixed and trained a projection MLP on a dataset of image-instruction pairs, supervised by text outputs (often synthesized by GPT-4 itself to bootstrap). GPT-4 would have done this at a far grander scale, possibly with human-labeled data (for instance, the GPT-4 System Card mentions testing on images with bounding boxes and such, implying some supervised signals). By the end, the model learns subtle correspondences: e.g. which patch embeddings correspond to human faces, or to printed text in the image, etc., enabling it to reason about the image in natural language.

Integration into the Transformer Architecture

One important architectural question is whether GPT-4’s multimodal abilities were achieved by inserting an encoder (making it an encoder-decoder model) or by keeping a single, unified transformer that handles both image and text tokens (a decoder-only model with extended input space). There has been debate on this. A pure decoder (like GPT-3) processes input tokens autoregressively, which is tricky for images unless you hack it by treating image patches as just more tokens (which is what we’ve described). The alternative is to bolt on an encoder: the images go through an encoder (bidirectional attention) and the text is generated from a decoder that cross-attends to the encoder output. This is how many vision-language models like PaLI and GIT are designed – the image tokens are consumed by an encoder, and the text decoder attends to them to produce the description. For GPT-4o (the “omnimodal” version that also handles audio and possibly other streams), some experts speculated that it indeed uses an encoder-decoder architecture, because handling continuous inputs like audio and real-time interaction might demand an encoder for integration. In an encoder-decoder setup, image tokens would not count toward the autoregressive context length (since they’d be on the encoder side). However, the fact that OpenAI’s API counts image tokens in the context suggests that, at least in the current implementation, they are being handled in-line in a single model context (more like a decoder-only model reading special tokens). One theory is that GPT-4o internally could have a hybrid architecture (an encoder feeding into the decoder), but the interface presents it as one model. Another simpler view is that GPT-4’s multimodal version remains a decoder-only transformer but trained with images inserted as pseudo-tokens. This is the approach taken by Kosmos-1 and LLaVA, and it has been shown to work surprisingly well. A community comment summarized GPT-4V as “probably a ViT with a projection layer feeding into a decoder LLM (like a CLIP-like model for images feeding GPT)” which aligns with our discussion.

In practice, the difference might be subtle – both architectures ultimately transform image pixels to embeddings and let the language model attend to them. GPT-4V’s multimodal model likely forms joint image-text representations internally, which could emerge from either architecture given sufficient training. The advantage of the unified sequence approach is simplicity: the multimodal training can be a straightforward extension of language modeling (just with image token placeholders). The advantage of an encoder-decoder is flexibility: e.g. you could generate image tokens or have multiple output modalities. In fact, GPT-4o (omnimodal) can output modalities like speech (and perhaps images in the future), which does hint at a more complex architecture for decoding. But for input side encoding, both GPT-4V and GPT-4o likely share the same mechanism for images – using a ViT-like encoder to produce patch embeddings that feed the main model.