Does gpt-4o-mini (via API) supports image inputs?

So I am using the gpt-4o-mini model (API call from my Python app) and it works fine.
I was just wondering if this specific model will also support images as input (and at the same cost), or do I need to use a different model for that?

Another question: does gpt-4o-mini (via API) can visit urls, fetch their content, and generate a rewritten version of it? And if not, then is there other openai model that can handle such a task?

gpt-4o-mini supports image inputs for vision as user message parts … at TWICE the monetary cost for the same image input as gpt-4o. The billing of tokens is multiplied by 33.3x for this model.

Instead, you might investigate the vision skill of gpt-4.1-mini on your application. It uses different technology and billing, especially cheaper for smaller images.

I added a comparison mode in my image pricing estimator. Instead of calculating and totaling different images as if for one API model call, you can add the same image a few times and switch individual images to see model pricing side-by-side.

3 Likes

In regards to accessing the internet, the only internal tool that is offered, in conjunction with the Responses API endpoint instead of chat completions, is web search.

It doesn’t act like you describe - it will search the web based on a query the AI writes, and explore several sites that you cannot dictate, to produce a search-like form answer when the tool is used. It will not recite individual pages.

You would need to write your own web page scraper that can be a function call if you wish to offer a “look at this page for me” feature (and modern dynamic web pages are hard to explore and read, being based on javascript, needing a utility like Selenium that uses a real web browser to get text data out.)

I’ve had success prompting it not to use Wikipedia :joy: but ofc it isn’t guaranteed.

iirc there should be third-party APIs for this. Attempting to rebuild web search sounds impractical; I couldn’t recommend it.

Unless it’s a hobby project. Anything to save a few bucks for fun!

Can confirm that J’s pricing calculator is pretty good. As for specific models, gpt-4.1-nano is good for cheap, high volume vision inference. It’s a very dumb model though, so I could only recommend it if you’re doing a ton of these. You’ll have to experiment and see what works best in your use case. Just be aware that gpt-4.1-mini and gpt-4.1-nano use this whacky “patches” technology that sometimes causes them to hallucinate when given unusual inputs. gpt-4.1 doesn’t use this technique.

For my real estate app with some good prompting I was able to get great results out of nano, so please don’t discard it by default for being dumb (who knows, maybe it will take it personally? :joy: )

I attempted to build a simple classifier with it where it needed to answer three yes-or-no questions about a given noun, and it appeared to simply choose at random on every run. I’ll run a 8B parameter model on my computer before using nano, but I’m glad (and mildly surprised) to learn you’ve had success with it.

Depends on the prompt and the task. For classification, it might be done in several steps:

  1. Vision 4.1-nano: clear ( and simple to follow) instructions about the context of the app, the ai persona profile, the current task, how to approach the image description, what to focus on, some response templates (not examples!!!). The goal is to get detailed (large text/precision) description of the image.
  2. Distillation 4.1-mini: convert large description to focused/classifiable text for the next step.
  3. Classification 4.1-mini/4.1-nano.

Also, please consider image embeddings for rather simple classifications (find which of the text vectors is the closest to the image vector).

Another approach would be to embed the “criterions” and match them to embeddings of either large or distilled description of your images.

But not knowing specifics of your application, it’s hard to tell something more valuable.