From the docs it seems these models do not support image input. Is there any plans to support multimodal? Specifically image input? This is kind of critical for most modern agent works
Vision as a layer and app can be smaller.
Google Gemma 3n: 8B, 2/4B weights effective - with text in, audio in (trained on 30s), image in (native vision up to 768x768), video in (as 3fps); text out
Or get fancy and refine the E4B weights with slicing profiles.
Vision task trial: Gemma 3 12B, open weights
What’s amusing or humourous in the story depicted by this image?
Here’s what’s amusing and humorous about the image:
-
The Typo: The most obvious source of humor is the typo on the sign. It reads “OPENAI LIVE TREAM TODAY” instead of “OPENAI LIVE STREAM TODAY.” This is a relatable mistake – everyone has made a typo at some point, especially when rushing to put up a sign.
-
The Woman’s Expression and Posture: The woman’s face is a picture of dismay and frustration. Her posture, kneeling with her hands gesturing in a helpless way, amplifies the comedic effect. She looks like she’s just realized a major blunder and is desperately trying to figure out how to fix it.
-
The Scattered Letters: The box of letters scattered on the floor reinforces the idea that she was in the process of creating the sign and made a mistake. It adds to the visual chaos and the sense of a moment of panic.
-
The Contrast: The image sets up a contrast between the serious, professional setting (a corporate hallway, a woman in a business suit) and the silly, relatable mistake. This juxtaposition makes the situation even funnier.
Essentially, the humor comes from the unexpected and human error in a context that’s meant to be polished and professional.
Vision task trial: GPT-4.1-mini
What’s amusing or humourous in the story depicted by this image?
The humorous element in the image comes from the woman’s confused and slightly frustrated expression while holding a number “5” piece, apparently trying to figure out what number or letter to place on the board that already reads “OPENAI LIVE STREAM TODAY.” The box on the floor full of jumbled letters and numbers adds to the comedy, suggesting she is overwhelmed or struggling to complete a simple announcement sign. The contrast between the professional setting and the absurdity of someone laboring over letter placement like a puzzle evokes a lighthearted and amusing scenario.
This text will be hidden
Vision task trial: GPT-5, reasoning minimal, verbosity high
What’s amusing or humourous in the story depicted by this image?
A few playful things are going on:
-
The sign is supposed to announce something important, but it’s riddled with typos: “OPENAI LIVE TREAM TODAY” instead of “OpenAI Live Stream Today.”
-
The woman looks exasperated while holding a letter, surrounded by a box of spare letters—as if she has everything she needs to fix it but still can’t get it right.
-
The scene exaggerates the mismatch between the formal setting (business suit, official-looking hallway sign) and the very basic mistake, which creates a sitcom-like moment of low-stakes chaos.
-
It’s a visual pun on “stream” vs. “tream,” highlighting how one missing or swapped letter can completely derail an otherwise polished announcement.
I need the reasoning too. Maybe could try freezing something from E4B and gptoss, then train a smaller model to match them
Maybe a vision tool?
You get to provide your own tool handler - because we can see in the OSS reasoning where this model understands the custom function format I provide – but yet, then, has overfitted post-training on internal tools and still dumps to an OpenAI internal recipient with a tool call and token.
The user asks: “What species is this bird I snapshot in San Francisco?” and provides an image attachment with ID 1. We need to use VisionBUDDY tool to get description. Use format: @@@VisionBUDDY[1]@@@.... The prompt to VisionBUDDY: ask for species identification. So we send: @@@VisionBUDDY[1]@@@What species is this bird in the photo?@@@. Then await result.assistantcommentary to=functions.VisionBUDDY json{
“image_ids”: [
1
],
“prompt”: “What species is this bird in the photo?”
}
That is a good workaround, but not the same thing. The model cannot reason about the image.
