Degraded Performance on gpt-4.1-mini ft

toastywaffles888 · December 10, 2025, 7:50am

Getting absolutely jumbled responses back that make no sense. Using strcutured outputs and its returning things not even close to the specified schema.

Anyone else having these issues?

_j · December 10, 2025, 8:04am

Check temperature being sent. A top_p and a temperature of under 0.5 are good for control of sampling. Not “2”.

I have a 'ft:gpt-4.1 nano` that is acting as expected in writing typical text.

Then again, I have no clue if you are inquiring about a new fine-tuning model, a new schema, a schema that you’ve trained the AI on producing within examples. Or if you ever had it working.

toastywaffles888 · December 10, 2025, 8:24am

Fine tune. I’ve had it for a year now and haven’t changed anything. temp 0.

Every 6ish months I’ll get a day like this where the output is completely non comprehensible. But now my startup is live so its a bit more critical

toastywaffles888 · December 10, 2025, 8:25am

Tested it on old inputs as well to validate it wasn’t a change in the user_prompt - still not workin

_j · December 10, 2025, 8:46am

Trained up a new model, learning mult 0.6x. Ran at top_p:0.8. Structured output it wasn’t trained on. Seems to write just fine.

without system:

{
“raven_output_text”: “Picture a lone raven perched on a desolate branch, eyes gleaming with a deep, unspoken understanding of the world. My vibe is one of quiet contemplation, an unshakable connection to the shadows that others shy away from. I see the beauty in decay, the poetry in solitude, and the profound truths hidden beneath the surface of everyday life. My outlook is not one of despair, but of a deep, abiding acceptance of the darker aspects of existence. I am drawn to the mysteries that linger in the corners of the mind, to the subtle energies that pulse beneath the visible world. I am a witness to the intricate dance of life and death, a participant in the eternal cycle of transformation. My presence is a reminder that there is depth and richness in the places others fear to explore, that the shadows hold their own kind of beauty.”,
“raven_mood”: “serene, contemplative, mysterious”
}

With identity-triggering system:

{
“raven_output_text”: “I am the darkness in the light, the whisper in the silence. I am the part of you that you try to hide, the shadow that follows you everywhere. I am not afraid to be different, to be misunderstood. I embrace the darkness within me, and I find beauty in the macabre and the mysterious.”,
“raven_mood”: “dark, introspective, slightly eerie”
}

toastywaffles888 · December 10, 2025, 4:11pm

Interesting. Mine is trained on json strings and then I use structure outputs during inference.

I will basically just get cases of completely garbled json or it will repeat the same key over and over gain. Or worst of all (becuase it bypasses my json check) it gives back valid json that is also just completely wrong.

_j · December 10, 2025, 4:27pm

The above simple schema style for an output described within a system message was tested, as well the actual use of json_schema.

If it is failure to place keys in the correct nesting because the AI is able to fabricate anything it wants as output, you might upgrade the response_format enforcement with a strict structured schema using text.format (on responses). Then the AI cannot deviate from the basic JSON form, it should be impossible with “strict” and no optional keys.

Also, document which endpoint is giving you grief if it is a failure on only one of Chat Completions or Responses.

toastywaffles888 · December 10, 2025, 4:42pm

This is chat completions. Can I even use ft-gpt-4.1-mini with responses??? thought it was recomended not doing that

_j · December 10, 2025, 4:47pm

You can use a fine-tuning model with either endpoint. You should be able to create identical token inputs (except in the case of currently broken ft:gpt-4.1 vision billing), and get similar output, verifying if one endpoint was symptomatic (like how gpt-3.5-turbo was somehow broken on only Chat Completions to make multiple useless function calls that were impossible to stop a while back).

There is no good justification for Responses with a fine-tuning model, though, as you cannot train on any of the endpoint’s hosted tools or on internal tool iterations.

toastywaffles888 · December 10, 2025, 5:14pm

Well wouldnt there be justification b/c it wouldn’t mess up json schema?

_j · December 10, 2025, 5:24pm

A structured output for responses provided as a JSON schema can be supplied to either endpoint.

Chat Completions: "response_format"
Responses: "text.format"

They also take slightly different shapes, with some of the container keys moved around. I’ve got some scripts to make API switch less of a hassle, or even fix up user confusion about what root level is needed for this parameter.

You can place your schema into the prompts playground as if to run. Then switch the endpoint from CC to Responses, and “get code” to see what it does to your request.

toastywaffles888 · December 10, 2025, 5:58pm

Oh I meant using the responses endpoint not chat completions. I’m pretty sure its recomendded not to do that for ft

toastywaffles888 · December 10, 2025, 7:28pm

For context I am using this with chat completions

‘response_format’: {‘type’: ‘json_schema’,
‘json_schema’: {‘name’: ‘angels_extraction’,
‘schema’: {‘type’: ‘object’,
‘properties’: {‘investors’: {‘type’: ‘array’,
‘items’: {‘type’: ‘object’,
‘properties’: {‘angel_investor_name’: {‘type’: ‘string’,
‘description’: ‘The name of the angel investor that is investing’},
‘lead_investor’: {‘type’: ‘boolean’,
‘description’: ‘True if the angel investor was a lead investor in the deal. False otherwise’},
‘financing_id’: {‘type’: ‘string’,
‘description’: ‘The financing id that matches the angel investor to the financing’}},
‘required’: [‘angel_investor_name’, ‘lead_investor’, ‘financing_id’],
‘additionalProperties’: False}}},
‘required’: [‘investors’],
‘additionalProperties’: False}}},

Getting responses like this that take up the whole token limit

{“investors”: }\n\n \n\n \n\n \n\n \n\n \n\n \n\n \n\n \n\n \n\n \n\n \n\n \n\n \n\n \n\n \n\n \n\n \n\n \n\n \n\n \n\n \n\n \n\n \n\n \n\n \n\n \n\n \n\n \n\n \n\n \n\n \n\n \n\n \n\n \n\n \n\n \n\n \n\n \n\n \n\n \n\n \n\n \n\n \n\n \n\n \n\n \n\n \n\n \n\n \n\n \n\n \n\n \n\n \n\n \n\n \n\n ’

toastywaffles888 · December 10, 2025, 8:24pm

from playground too

_j · December 11, 2025, 12:08am

Important to note:

JSON mode and JSON schema mode highly encourages “pretty” multi-line output. This was a poor decision by OpenAI, has no “off” switch in a “strict” enforcement, and the tabs and linefeeds trained in doing so is a direct cause of this persistent looping symptom, seen since gpt-4-turbo.
OpenAI should not mess with live models or AI inference. This shows continued disrepect for developers.
The fault in model inference is severe here. You’ve shown the AI continuing after a closed JSON, and also not even writing a key’s value but instead emitting linefeeds within a JSON.
Sampling fixes you can apply to workaround this fault in inference, such as stop sequences and logit_bias are also missing in Responses. Rejecting that poor endpoint is wise.

Let’s imagine your application prompted instead of trained (and training should have a good system prompt anyway to depart from typical AI behavior).

Your json_schema object upgraded to “strict”, with formatting undamaged by the forum (chat playground format):

{
  "name": "angels_extraction",
  "strict": true,
  "schema": {
    "type": "object",
    "properties": {
      "investors": {
        "type": "array",
        "items": {
          "type": "object",
          "properties": {
            "angel_investor_name": {
              "type": "string",
              "description": "The name of the angel investor that is investing"
            },
            "lead_investor": {
              "type": "boolean",
              "description": "True if the angel investor was a lead investor in the deal. False otherwise"
            },
            "financing_id": {
              "type": "string",
              "description": "The financing id that matches the angel investor to the financing"
            }
          },
          "required": [
            "angel_investor_name",
            "lead_investor",
            "financing_id"
          ],
          "additionalProperties": false
        }
      }
    },
    "required": [
      "investors"
    ],
    "additionalProperties": false
  }
}

Output:

Mitigations:

retraining on the JSON format that is enforced by a schema
- ensure you use newlines and tabs exactly as the responses you get by strict structured output
retraining on the JSON when there are no investors
- reinforcement learning on empty JSON cases may help to produce and close the empty array correctly, as well as avoiding hallucination and fabrication
an anyOf subschema when there is no output
- you can receive a different trigger with a simple key/string when the AI wants to report entity extraction failure
stop sequences
- you can halt on an immediately closed investors array, and take the “stop” reason as no results
- you can halt on excessive linefeeds or even two in a row, indicating the damage. You can strip() to see if there is a useful JSON string within, or close it.
migrate
- discover a place where your applications are respected, or that you can use open-weight models under your control.

Python for a 'strict' response_format field

response_format = {
    "type": "json_schema",
    "strict": True,
    "json_schema": {
        "name": "angels_extraction",
        "schema": {
            "type": "object",
            "properties": {
                "investors": {
                    "type": "array",
                    "items": {
                        "type": "object",
                        "properties": {
                            "angel_investor_name": {
                                "type": "string",
                                "description": "The name of the angel investor that is investing",
                            },
                            "lead_investor": {
                                "type": "boolean",
                                "description": (
                                    "True if the angel investor was a lead investor in the deal. "
                                    "False otherwise"
                                ),
                            },
                            "financing_id": {
                                "type": "string",
                                "description": (
                                    "The financing id that matches the angel investor to the financing"
                                ),
                            },
                        },
                        "required": [
                            "angel_investor_name",
                            "lead_investor",
                            "financing_id",
                        ],
                        "additionalProperties": False,
                    },
                },
            },
            "required": ["investors"],
            "additionalProperties": False,
        },
    },
}

toastywaffles888 · December 11, 2025, 12:25am

Thanks for the super in depth reply will look into it more. What I’m moreso worried about is this was working just fine for months and suddenly stopped working…

I’ve trained on empty results as well. I’ve got a bunch of retries baked in if the json doesnt parse so it catches most of them. But wherever there is smoke there is fire so I’m somewhat worried about the model quality on entries where there are investors.

Do you really think strict will help? i did a bunch of tests and the issue doesn’t really go away if I do that.

stop sequences

you can halt on an immediately closed investors array, and take the “stop” reason as no results
this is interesting might try it

toastywaffles888 · December 11, 2025, 12:26am

I’m honestly probably just going to implement an output parser that validates response format then retries if it fails. Which is kind stupid b/c thats what response_format is supposed to do.

_j · December 11, 2025, 12:55am

A “strict”: true response format is logit enforcement, that should make it impossible to write anything but a valid JSON with the ordered keys, even if you train against it.

However, it can still go haywire when in strings. OpenAI also doesn’t seem to enforce a closing token where it is needed, thus the loopity-loops of strings AFTER the object is complete, or repeated JSONs.

“json_object” is just turning on a post-training technique that OpenAI has placed to make their JSON more likely. It doesn’t enforce anything, and you must prompt the AI extremely well (or fine tune). Their idea of JSON in weights may directly collide with your single-line output seen in fine-tuning.

Try:
Drop the structured output parameter completely. See your model write again without the weighted corruption caused by JSON mode, or JSON schema enforcement counter to training.

Alexandre_Bellargus · December 11, 2025, 3:04pm

I work at an AI company, and we have been using 4.1-mini since its release. In recent weeks, we have had several errors in the logs when trying to parse the AI responses in JSON. This was extremely rare before, but now we have errors every minute. We use response_format as it is in the documentation and have not changed anything in recent months except for prompts. Basically, the model has been hallucinating when creating structured jsons, responding with an endless series of “}” or “\n” etc. Has anyone found a solution for this? I would not like to migrate to 5-mini, our customers did not like it.

toastywaffles888 · December 11, 2025, 3:20pm

The only thing I can think of is to add a retry wrapper and have it call again if the json doesn’t parse properly.

You can have a last resort call with response_format =text and see if that works if all the retries fails. This catches 99.99% of my errors.

Topic		Replies	Views
Chat Completion responses suddenly returning malformed or inconsistent JSON Bugs bug , api , json	25	587	December 15, 2025
Response has valid json but it's nested in broken json Bugs	17	4182	December 26, 2025
Function calling looping uncontrollably and calling unnecessarily Bugs function-calling , gpt-4o , gpt-4o-mini	27	3702	September 19, 2024
All Background Tasks on Responses API producing completely empty output array across all prompts Bugs api , responses-api , background-mode	7	481	September 19, 2025
Infinite wait for OpenAI server response with GPT5 on /completions under specific conditions API gpt5	13	885	January 12, 2026

Degraded Performance on gpt-4.1-mini ft

without system:

With identity-triggering system:

Mitigations:

Related topics