Degraded Performance on gpt-4.1-mini ft

Getting absolutely jumbled responses back that make no sense. Using strcutured outputs and its returning things not even close to the specified schema.

Anyone else having these issues?

Check temperature being sent. A top_p and a temperature of under 0.5 are good for control of sampling. Not “2”.

I have a 'ft:gpt-4.1 nano` that is acting as expected in writing typical text.

Then again, I have no clue if you are inquiring about a new fine-tuning model, a new schema, a schema that you’ve trained the AI on producing within examples. Or if you ever had it working.

1 Like

Fine tune. I’ve had it for a year now and haven’t changed anything. temp 0.

Every 6ish months I’ll get a day like this where the output is completely non comprehensible. But now my startup is live so its a bit more critical

Tested it on old inputs as well to validate it wasn’t a change in the user_prompt - still not workin

Trained up a new model, learning mult 0.6x. Ran at top_p:0.8. Structured output it wasn’t trained on. Seems to write just fine.

without system:

{
“raven_output_text”: “Picture a lone raven perched on a desolate branch, eyes gleaming with a deep, unspoken understanding of the world. My vibe is one of quiet contemplation, an unshakable connection to the shadows that others shy away from. I see the beauty in decay, the poetry in solitude, and the profound truths hidden beneath the surface of everyday life. My outlook is not one of despair, but of a deep, abiding acceptance of the darker aspects of existence. I am drawn to the mysteries that linger in the corners of the mind, to the subtle energies that pulse beneath the visible world. I am a witness to the intricate dance of life and death, a participant in the eternal cycle of transformation. My presence is a reminder that there is depth and richness in the places others fear to explore, that the shadows hold their own kind of beauty.”,
“raven_mood”: “serene, contemplative, mysterious”
}

With identity-triggering system:

{
“raven_output_text”: “I am the darkness in the light, the whisper in the silence. I am the part of you that you try to hide, the shadow that follows you everywhere. I am not afraid to be different, to be misunderstood. I embrace the darkness within me, and I find beauty in the macabre and the mysterious.”,
“raven_mood”: “dark, introspective, slightly eerie”
}

2 Likes

Interesting. Mine is trained on json strings and then I use structure outputs during inference.

I will basically just get cases of completely garbled json or it will repeat the same key over and over gain. Or worst of all (becuase it bypasses my json check) it gives back valid json that is also just completely wrong.

The above simple schema style for an output described within a system message was tested, as well the actual use of json_schema.

If it is failure to place keys in the correct nesting because the AI is able to fabricate anything it wants as output, you might upgrade the response_format enforcement with a strict structured schema using text.format (on responses). Then the AI cannot deviate from the basic JSON form, it should be impossible with “strict” and no optional keys.

Also, document which endpoint is giving you grief if it is a failure on only one of Chat Completions or Responses.

This is chat completions. Can I even use ft-gpt-4.1-mini with responses??? thought it was recomended not doing that

You can use a fine-tuning model with either endpoint. You should be able to create identical token inputs (except in the case of currently broken ft:gpt-4.1 vision billing), and get similar output, verifying if one endpoint was symptomatic (like how gpt-3.5-turbo was somehow broken on only Chat Completions to make multiple useless function calls that were impossible to stop a while back).

There is no good justification for Responses with a fine-tuning model, though, as you cannot train on any of the endpoint’s hosted tools or on internal tool iterations.

1 Like

Well wouldnt there be justification b/c it wouldn’t mess up json schema?

1 Like

A structured output for responses provided as a JSON schema can be supplied to either endpoint.

  • Chat Completions: "response_format"
  • Responses: "text.format"

They also take slightly different shapes, with some of the container keys moved around. I’ve got some scripts to make API switch less of a hassle, or even fix up user confusion about what root level is needed for this parameter.

You can place your schema into the prompts playground as if to run. Then switch the endpoint from CC to Responses, and “get code” to see what it does to your request.

1 Like

Oh I meant using the responses endpoint not chat completions. I’m pretty sure its recomendded not to do that for ft

For context I am using this with chat completions

‘response_format’: {‘type’: ‘json_schema’,
‘json_schema’: {‘name’: ‘angels_extraction’,
‘schema’: {‘type’: ‘object’,
‘properties’: {‘investors’: {‘type’: ‘array’,
‘items’: {‘type’: ‘object’,
‘properties’: {‘angel_investor_name’: {‘type’: ‘string’,
‘description’: ‘The name of the angel investor that is investing’},
‘lead_investor’: {‘type’: ‘boolean’,
‘description’: ‘True if the angel investor was a lead investor in the deal. False otherwise’},
‘financing_id’: {‘type’: ‘string’,
‘description’: ‘The financing id that matches the angel investor to the financing’}},
‘required’: [‘angel_investor_name’, ‘lead_investor’, ‘financing_id’],
‘additionalProperties’: False}}},
‘required’: [‘investors’],
‘additionalProperties’: False}}},

Getting responses like this that take up the whole token limit

{“investors”: }\n\n \n\n \n\n \n\n \n\n \n\n \n\n \n\n \n\n \n\n \n\n \n\n \n\n \n\n \n\n \n\n \n\n \n\n \n\n \n\n \n\n \n\n \n\n \n\n \n\n \n\n \n\n \n\n \n\n \n\n \n\n \n\n \n\n \n\n \n\n \n\n \n\n \n\n \n\n \n\n \n\n \n\n \n\n \n\n \n\n \n\n \n\n \n\n \n\n \n\n \n\n \n\n \n\n \n\n \n\n \n\n ’

from playground too

1 Like

Important to note:

  • JSON mode and JSON schema mode highly encourages “pretty” multi-line output. This was a poor decision by OpenAI, has no “off” switch in a “strict” enforcement, and the tabs and linefeeds trained in doing so is a direct cause of this persistent looping symptom, seen since gpt-4-turbo.
  • OpenAI should not mess with live models or AI inference. This shows continued disrepect for developers.
  • The fault in model inference is severe here. You’ve shown the AI continuing after a closed JSON, and also not even writing a key’s value but instead emitting linefeeds within a JSON.
  • Sampling fixes you can apply to workaround this fault in inference, such as stop sequences and logit_bias are also missing in Responses. Rejecting that poor endpoint is wise.

Let’s imagine your application prompted instead of trained (and training should have a good system prompt anyway to depart from typical AI behavior).

Your json_schema object upgraded to “strict”, with formatting undamaged by the forum (chat playground format):

{
  "name": "angels_extraction",
  "strict": true,
  "schema": {
    "type": "object",
    "properties": {
      "investors": {
        "type": "array",
        "items": {
          "type": "object",
          "properties": {
            "angel_investor_name": {
              "type": "string",
              "description": "The name of the angel investor that is investing"
            },
            "lead_investor": {
              "type": "boolean",
              "description": "True if the angel investor was a lead investor in the deal. False otherwise"
            },
            "financing_id": {
              "type": "string",
              "description": "The financing id that matches the angel investor to the financing"
            }
          },
          "required": [
            "angel_investor_name",
            "lead_investor",
            "financing_id"
          ],
          "additionalProperties": false
        }
      }
    },
    "required": [
      "investors"
    ],
    "additionalProperties": false
  }
}

Output:

Mitigations:

  • retraining on the JSON format that is enforced by a schema
    • ensure you use newlines and tabs exactly as the responses you get by strict structured output
  • retraining on the JSON when there are no investors
    • reinforcement learning on empty JSON cases may help to produce and close the empty array correctly, as well as avoiding hallucination and fabrication
  • an anyOf subschema when there is no output
    • you can receive a different trigger with a simple key/string when the AI wants to report entity extraction failure
  • stop sequences
    • you can halt on an immediately closed investors array, and take the “stop” reason as no results
    • you can halt on excessive linefeeds or even two in a row, indicating the damage. You can strip() to see if there is a useful JSON string within, or close it.
  • migrate
    • discover a place where your applications are respected, or that you can use open-weight models under your control.
Python for a 'strict' response_format field
response_format = {
    "type": "json_schema",
    "strict": True,
    "json_schema": {
        "name": "angels_extraction",
        "schema": {
            "type": "object",
            "properties": {
                "investors": {
                    "type": "array",
                    "items": {
                        "type": "object",
                        "properties": {
                            "angel_investor_name": {
                                "type": "string",
                                "description": "The name of the angel investor that is investing",
                            },
                            "lead_investor": {
                                "type": "boolean",
                                "description": (
                                    "True if the angel investor was a lead investor in the deal. "
                                    "False otherwise"
                                ),
                            },
                            "financing_id": {
                                "type": "string",
                                "description": (
                                    "The financing id that matches the angel investor to the financing"
                                ),
                            },
                        },
                        "required": [
                            "angel_investor_name",
                            "lead_investor",
                            "financing_id",
                        ],
                        "additionalProperties": False,
                    },
                },
            },
            "required": ["investors"],
            "additionalProperties": False,
        },
    },
}

Thanks for the super in depth reply will look into it more. What I’m moreso worried about is this was working just fine for months and suddenly stopped working…

I’ve trained on empty results as well. I’ve got a bunch of retries baked in if the json doesnt parse so it catches most of them. But wherever there is smoke there is fire so I’m somewhat worried about the model quality on entries where there are investors.

Do you really think strict will help? i did a bunch of tests and the issue doesn’t really go away if I do that.

  1. stop sequences
  • you can halt on an immediately closed investors array, and take the “stop” reason as no results

  • this is interesting might try it

I’m honestly probably just going to implement an output parser that validates response format then retries if it fails. Which is kind stupid b/c thats what response_format is supposed to do.

A “strict”: true response format is logit enforcement, that should make it impossible to write anything but a valid JSON with the ordered keys, even if you train against it.

However, it can still go haywire when in strings. OpenAI also doesn’t seem to enforce a closing token where it is needed, thus the loopity-loops of strings AFTER the object is complete, or repeated JSONs.

“json_object” is just turning on a post-training technique that OpenAI has placed to make their JSON more likely. It doesn’t enforce anything, and you must prompt the AI extremely well (or fine tune). Their idea of JSON in weights may directly collide with your single-line output seen in fine-tuning.

Try:
Drop the structured output parameter completely. See your model write again without the weighted corruption caused by JSON mode, or JSON schema enforcement counter to training.

I work at an AI company, and we have been using 4.1-mini since its release. In recent weeks, we have had several errors in the logs when trying to parse the AI responses in JSON. This was extremely rare before, but now we have errors every minute. We use response_format as it is in the documentation and have not changed anything in recent months except for prompts. Basically, the model has been hallucinating when creating structured jsons, responding with an endless series of “}” or “\n” etc. Has anyone found a solution for this? I would not like to migrate to 5-mini, our customers did not like it.

1 Like

The only thing I can think of is to add a retry wrapper and have it call again if the json doesn’t parse properly.

You can have a last resort call with response_format =text and see if that works if all the retries fails. This catches 99.99% of my errors.