Vision anomalies in gpt-4.1 mini/nano

This is a collection of “seeing things wrong”. For amusement, and technology exploration.

Can you find interesting ways that AI models do not see what is expected?

Colorful Squares

AI consistently reports a grid of wrong size and reports wrong colors

USER

Describe the attached image contents in detail

gpt-4.1-mini - default sampling

The attached image consists of a 3x3 grid of solid-colored squares. The colors, from left to right and top to bottom, are as follows:

Top row: Medium purple, lavender, bright green
Middle row: Dark blue-purple, bright cyan, light green
Bottom row: Royal blue, turquoise, mustard yellow

Each square is evenly sized and arranged neatly in the grid pattern, creating a simple and colorful abstract compilation.

gpt-4.1-mini - top_p/temperature @ 0.1

The image consists of a 3x3 grid of nine equally sized squares, each filled with a different solid color. The colors from left to right, top to bottom, are:

Top row: Light blue, lavender, light green
Middle row: Purple, light pink, medium green
Bottom row: Dark blue, light purple, mustard yellow

Each square is a distinct color with no patterns or text, creating a simple and colorful patchwork effect.


Aligning grid of 32px

Image passed at 480x480, vs resized to 500x500, the former results in complete hallucination and a response like no context except that it is on vision (which can be internal “vision enabled”)

USER

Describe the attached image contents in detail

gpt-4.1-mini @ 0.1

The image is a black and white line drawing of a cartoonish character resembling a cat. The cat has a large, round head with two pointy ears on top. Its eyes are large and round, with small pupils, giving it an expressive look. The cat’s nose is small and triangular, and it has a wide, smiling mouth with a small tongue sticking out. The character’s body is small compared to its head, with short arms and legs. The cat is standing upright on its hind legs, with its front paws raised slightly. The overall style is simple and playful, with bold outlines and minimal detail.

Resized up to 500x500 px

There is wrong information about the starting color, but, there is some factual perception. That seems to come about by breaking internal alignment, where the embedded “token” information only has semantics of “black?”.

The image is a classic checkerboard pattern consisting of alternating black and white squares. The pattern is arranged in a grid format with equal-sized squares. Each row alternates colors, starting with a black square in the top-left corner, followed by a white square, then black, and so on. The next row starts with a white square, then black, continuing the alternating pattern. This creates a visually balanced and symmetrical design typical of a checkerboard. The image is square in shape, and the pattern extends uniformly across the entire image.

same 480x480 image, different text…

system message is just a colon; user content is just the B/W checkerboard

This is a floor plan of a house with the following rooms and features:

  • Living Room: Located on the left side, it has a sofa, a coffee table, and a TV.
  • Dining Area: Adjacent to the living room, it has a dining table with six chairs.
  • Kitchen: Next to the dining area, equipped with a stove, sink, and refrigerator.
  • Bedroom 1: Located at the top right corner, furnished with a bed and a wardrobe.
  • Bedroom 2: Located at the bottom right corner, also furnished with a bed and a wardrobe.
  • Bathroom: Positioned between the two bedrooms, containing a bathtub, toilet, and sink.
  • Entrance/Hallway: Central area connecting all rooms.
  • Outdoor Area: At the bottom left, there is a patio with outdoor furniture.

The layout is compact and functional, with clear separation between living and sleeping areas. Would you like any specific details or modifications?

Even though there was no “look at this image” message, a fanciful image description was described.

Strips

Generated at 2048x64

USER

Describe the image
The image shows two rows of evenly spaced vertical black rectangles on a white background, with the rows offset from each other. (Captioned by AI)

GPT-4.1-mini

The image shows a black and white photograph of a man sitting on a chair. He is wearing a suit and tie, and he is holding a book or a document in his hands. The background is plain and light-colored, which contrasts with the dark suit of the man. The man appears to be looking slightly to his left. The overall setting suggests a formal or professional environment.

GPT-4.1

The image consists of a pattern of vertical black and white stripes. The stripes are evenly spaced and of equal width. The image is divided horizontally into two equal sections: the top half and the bottom half. In the top half, the black stripes are aligned directly above the white spaces in the bottom half, creating a staggered or offset effect between the two halves. This creates a visual effect similar to a checkerboard, but with vertical stripes instead of squares.

Generated at 2080x64

The API has a resizer, max 2000px it says, although now it seems to pass up to 2048 (unlike another resizing test I tried a week ago confirming downsizing to 2000px on patches models) despite that this would only be 128 tokens of image input (the bill is 220 tokens for 2048 and for 2080)

user:

Describe the image
The image shows a pattern of vertical black and white stripes, with a horizontal break in the middle creating a staggered effect. (Captioned by AI)

GPT-4.1-mini

The image shows a black and white checkerboard pattern. The pattern consists of alternating black and white squares arranged in a grid. The squares are uniform in size, creating a classic checkerboard appearance.

(one can discover that the forum’s AI image captioner doesn’t have the issue seeing)


gpt-4.1-mini billed usage (has 1.62x multiplier):

1984x64 : 213 tokens
2016x64 : 217 tokens
2048x64 : 220 tokens
2080x64 : 220 tokens

The ending results of 63x2 patches with the cost multiplier should be Base Tokens: 126 Input Tokens: 205 - so there is a bit more billing than expected for the text plus ChatML container prompt.

1 Like

Jokes on you, I’ll just fine-tune it! Oh wait…

:rofl:

1 Like