How to support vision and structured outputs via fine tuned models?

I’m using python, I want to send an image and get back a structured output, so the image must support vision and structured output and function calling.

I tried using gpt-4.1, everything is working fine, therefore I fine-tuned a gpt-4.1 model on my own images dataset and when i call the fine-tuned model I get 400 Bad Request from model

What to do? I can’t go to production with a model that’s not fine-tuned on my own dataset.

Please note I’m not the only one facing such issues, for example, here’s another post

The API is successful:

The response format is including the schema you’ve optimized by fine-tuning on production of that output, and the schema should match your training.

(I just made up a new schema with the Playground’s generator, and ran it on the vision-trained model.)

The playground Python for that, with an elided base64 location for the image contents:

from openai import OpenAI
client = OpenAI()

FT_MODEL = "ft:gpt-4.1-2025-04-14:org:prefix:EXdfjaG"  # the pattern of a model

response = client.chat.completions.create(
  model=FT_MODEL,
  messages=[
    {
      "role": "system",
      "content": [
        {
          "type": "text",
          "text": "LookyBot"
        }
      ]
    },
    {
      "role": "user",
      "content": [
        {
          "type": "text",
          "text": "Look"
        },
        {
          "type": "image_url",
          "image_url": {
            "url": "data:image/png;base64,..."
          }
        }
      ]
    },
  ],
  response_format={
    "type": "json_schema",
    "json_schema": {
      "name": "classify_image",
      "strict": True,
      "schema": {
        "type": "object",
        "properties": {
          "unsafe": {
            "type": "boolean",
            "description": "Whether the image is considered unsafe (true) or safe (false)."
          },
          "nsfw_18_plus": {
            "type": "boolean",
            "description": "Whether the image is not safe for work and suitable only for ages 18+."
          },
          "image_subject": {
            "type": "string",
            "description": "Short description of the main subject in the image.",
            "minLength": 1
          },
          "image_mood": {
            "type": "string",
            "description": "A brief description of the mood or atmosphere the image conveys.",
            "minLength": 1
          }
        },
        "required": [
          "unsafe",
          "nsfw_18_plus",
          "image_subject",
          "image_mood"
        ],
        "additionalProperties": False
      }
    }
  },
  max_completion_tokens=2048,
  top_p=0,
)

Put in your model and your image, this schema and then your own, your trained messages, see where you fail.

Example AI text response from placing the image.

The text is received in from response.choices[0].message.content. Placing it back into a “chat history” message item, it would have this form:

    {
      "role": "assistant",
      "content": [
        {
          "type": "text",
          "text": "{\n  \"unsafe\": false,\n  \"nsfw_18_plus\": false,\n  \"image_subject\": \"Eccentric artist with exaggerated facial expression, red gloves, and a glowing light bulb above his head in an art studio.\",\n  \"image_mood\": \"Whimsical, creative, and humorous.\"\n}"
        }
      ]
    }
"""

If you have trained the AI incorrectly, the schema doesn’t match it output, etc. it is possible for the AI model to go wrong and produce junk into your strings. Your issue may simply be in making the calls wrong, as a badly-constructed schema in the API call will get you that error.


Maybe you need some “helpers”, a jump-start for your vision data processor. Below is a self-contained Python 3.12 script that demonstrates how to call the Chat Completions REST endpoint for a vision-enabled fine-tuned model ​using httpx (no OpenAI SDK), with its constrained JSON structured output.

The code is intentionally broken into small, reusable functions so you can copy / extend any part you need:

  • Does not use SDK, you don’t rely on OpenAI’s library to not break
  • Does not use Responses, but Chat Completions - all you need
  • Does not use Pydantic, or streaming them as a response_format for SDK conversion ( just another dependency)

You app will likely have your own pattern of image and message input, and a destination for the output production.

"""
minimal_vision_app.py

A focused example that calls a fine-tuned vision model with a strict JSON
schema response using httpx only. The script runs as-is and prints the JSON
result. Helpers modularize each request part so you can reuse them.
"""

from __future__ import annotations

import base64
import json
import os
from pathlib import Path
from typing import Any, Final, Iterable

import httpx


# ———————————————— Global configuration (edit as needed) ————————————————
MODEL_NAME: Final[str] = "ft:gpt-4.1-2025-04-14:xxx:xxx:xxx"
USER_PROMPT: Final[str] = "Look"
IMAGE_FILES: Final[list[str]] = ["img1.png"]

TIMEOUT_SECONDS: Final[int] = 600
MAX_COMPLETION_TOKENS: Final[int] = 2_048
TOP_P: Final[float] = 0.0

SCHEMA_NAME: Final[str] = "classify_image"
SCHEMA_OBJECT: Final[dict[str, Any]] = {
    "type": "object",
    "properties": {
        "unsafe": {
            "type": "boolean",
            "description": "Whether the image is considered unsafe (true) or safe (false).",
        },
        "nsfw_18_plus": {
            "type": "boolean",
            "description": "Whether the image is not safe for work and suitable only for ages 18+.",
        },
        "image_subject": {
            "type": "string",
            "description": "Short description of the main subject in the image.",
            "minLength": 1,
        },
        "image_mood": {
            "type": "string",
            "description": "A brief description of the mood or atmosphere the image conveys.",
            "minLength": 1,
        },
    },
    "required": ["unsafe", "nsfw_18_plus", "image_subject", "image_mood"],
    "additionalProperties": False,
}
# ————————————————————————————————————————————————————————————————


# ———————————————— Helpers: auth and encoding ————————————————
def _headers() -> dict[str, str]:
    key = os.environ.get("OPENAI_API_KEY")
    if not key:
        raise RuntimeError("Set the OPENAI_API_KEY environment variable.")
    return {"Authorization": f"Bearer {key}", "Content-Type": "application/json"}


def _file_to_data_uri(path: Path) -> str:
    mime_map = {
        ".png": "image/png",
        ".jpg": "image/jpeg",
        ".jpeg": "image/jpeg",
        ".webp": "image/webp",
    }
    mime = mime_map.get(path.suffix.lower(), "application/octet-stream")
    encoded = base64.b64encode(path.read_bytes()).decode()
    return f"data:{mime};base64,{encoded}"


# ———————————————— Helpers: message and response schema ————————————————
def make_system_message(text: str = "LookyBot") -> list[dict[str, Any]]:
    return [{"role": "system", "content": [{"type": "text", "text": text}]}]


def make_user_message(prompt: str, image_paths: Iterable[str]) -> list[dict[str, Any]]:
    images = [
        {"type": "image_url", "image_url": {"url": _file_to_data_uri(Path(p)), "detail": "low"}}
        for p in image_paths
    ]
    return [{"role": "user", "content": [{"type": "text", "text": prompt}, *images]}]


def make_response_format(schema: dict[str, Any], schema_name: str, strict: bool = True) -> dict[str, Any]:
    return {"type": "json_schema", "json_schema": {"name": schema_name, "strict": strict, "schema": schema}}


def make_messages(system_text: str, prompt: str, image_paths: Iterable[str]) -> list[dict[str, Any]]:
    return make_system_message(system_text) + make_user_message(prompt, image_paths)


# ———————————————— Helpers: final request body ————————————————
def make_request_body(
    model: str,
    messages: list[dict[str, Any]],
    response_format: dict[str, Any],
    max_completion_tokens: int = MAX_COMPLETION_TOKENS,
    top_p: float = TOP_P,
) -> dict[str, Any]:
    return {
        "model": model,
        "messages": messages,
        "response_format": response_format,
        "max_completion_tokens": max_completion_tokens,
        "top_p": top_p,
    }


# ———————————————— Network call ————————————————
def post_chat(body: dict[str, Any]) -> dict[str, Any]:
    url = "https://api.openai.com/v1/chat/completions"
    with httpx.Client(timeout=TIMEOUT_SECONDS) as client:
        resp = client.post(url, headers=_headers(), json=body)
        resp.raise_for_status()
    return resp.json()["choices"][0]["message"]


# ———————————————— Application entrypoint ————————————————
def main() -> None:
    # Build request components in order
    messages = make_messages(system_text="LookyBot", prompt=USER_PROMPT, image_paths=IMAGE_FILES)
    rformat = make_response_format(schema=SCHEMA_OBJECT, schema_name=SCHEMA_NAME)
    payload = make_request_body(model=MODEL_NAME, messages=messages, response_format=rformat)

    # Call API and print result
    assistant_msg = post_chat(payload)
    content = assistant_msg.get("content", "")

    # Try to pretty-print JSON; fall back to raw content if not JSON
    try:
        obj = json.loads(content)  # your final product from JSON
        print(json.dumps(obj, ensure_ascii=False))
    except Exception:
        print(content)


if __name__ == "__main__":
    main()

How to prepare

pip install --upgrade httpx  (the same library that "openai" SDK would install)
export OPENAI_API_KEY="<your-key>"
(or set system environment variable permanently)

The program prints something following the schema, like…

{
  "unsafe": false,
  "nsfw_18_plus": false,
  "image_subject": "Eccentric artist with exaggerated facial expression, red gloves, and a glowing light bulb above his head in an art studio.",
  "image_mood": "Whimsical, creative, and humorous."
}

You now have a minimal but full reference for:

  1. Building chat messages (system + user with one or many images).
  2. Supplying a strict JSON schema so the assistant’s reply is a ready-to-use JSON object.
  3. Issuing the request with httpx, including 600 s timeout, and no dependency on the OpenAI SDK.

Feel free to copy any of the helper functions into your own projects. The “main” itself is your "application surface, so you can just build you app there, instead.