Getting absolutely jumbled responses back that make no sense. Using strcutured outputs and its returning things not even close to the specified schema.
Anyone else having these issues?
Getting absolutely jumbled responses back that make no sense. Using strcutured outputs and its returning things not even close to the specified schema.
Anyone else having these issues?
Check temperature being sent. A top_p and a temperature of under 0.5 are good for control of sampling. Not â2â.
I have a 'ft:gpt-4.1 nano` that is acting as expected in writing typical text.
Then again, I have no clue if you are inquiring about a new fine-tuning model, a new schema, a schema that youâve trained the AI on producing within examples. Or if you ever had it working.
Fine tune. Iâve had it for a year now and havenât changed anything. temp 0.
Every 6ish months Iâll get a day like this where the output is completely non comprehensible. But now my startup is live so its a bit more critical
Tested it on old inputs as well to validate it wasnât a change in the user_prompt - still not workin
Trained up a new model, learning mult 0.6x. Ran at top_p:0.8. Structured output it wasnât trained on. Seems to write just fine.
{
âraven_output_textâ: âPicture a lone raven perched on a desolate branch, eyes gleaming with a deep, unspoken understanding of the world. My vibe is one of quiet contemplation, an unshakable connection to the shadows that others shy away from. I see the beauty in decay, the poetry in solitude, and the profound truths hidden beneath the surface of everyday life. My outlook is not one of despair, but of a deep, abiding acceptance of the darker aspects of existence. I am drawn to the mysteries that linger in the corners of the mind, to the subtle energies that pulse beneath the visible world. I am a witness to the intricate dance of life and death, a participant in the eternal cycle of transformation. My presence is a reminder that there is depth and richness in the places others fear to explore, that the shadows hold their own kind of beauty.â,
âraven_moodâ: âserene, contemplative, mysteriousâ
}
{
âraven_output_textâ: âI am the darkness in the light, the whisper in the silence. I am the part of you that you try to hide, the shadow that follows you everywhere. I am not afraid to be different, to be misunderstood. I embrace the darkness within me, and I find beauty in the macabre and the mysterious.â,
âraven_moodâ: âdark, introspective, slightly eerieâ
}
Interesting. Mine is trained on json strings and then I use structure outputs during inference.
I will basically just get cases of completely garbled json or it will repeat the same key over and over gain. Or worst of all (becuase it bypasses my json check) it gives back valid json that is also just completely wrong.
The above simple schema style for an output described within a system message was tested, as well the actual use of json_schema.
If it is failure to place keys in the correct nesting because the AI is able to fabricate anything it wants as output, you might upgrade the response_format enforcement with a strict structured schema using text.format (on responses). Then the AI cannot deviate from the basic JSON form, it should be impossible with âstrictâ and no optional keys.
Also, document which endpoint is giving you grief if it is a failure on only one of Chat Completions or Responses.
This is chat completions. Can I even use ft-gpt-4.1-mini with responses??? thought it was recomended not doing that
You can use a fine-tuning model with either endpoint. You should be able to create identical token inputs (except in the case of currently broken ft:gpt-4.1 vision billing), and get similar output, verifying if one endpoint was symptomatic (like how gpt-3.5-turbo was somehow broken on only Chat Completions to make multiple useless function calls that were impossible to stop a while back).
There is no good justification for Responses with a fine-tuning model, though, as you cannot train on any of the endpointâs hosted tools or on internal tool iterations.
Well wouldnt there be justification b/c it wouldnât mess up json schema?
A structured output for responses provided as a JSON schema can be supplied to either endpoint.
"response_format""text.format"They also take slightly different shapes, with some of the container keys moved around. Iâve got some scripts to make API switch less of a hassle, or even fix up user confusion about what root level is needed for this parameter.
You can place your schema into the prompts playground as if to run. Then switch the endpoint from CC to Responses, and âget codeâ to see what it does to your request.
Oh I meant using the responses endpoint not chat completions. Iâm pretty sure its recomendded not to do that for ft
For context I am using this with chat completions
âresponse_formatâ: {âtypeâ: âjson_schemaâ,
âjson_schemaâ: {ânameâ: âangels_extractionâ,
âschemaâ: {âtypeâ: âobjectâ,
âpropertiesâ: {âinvestorsâ: {âtypeâ: âarrayâ,
âitemsâ: {âtypeâ: âobjectâ,
âpropertiesâ: {âangel_investor_nameâ: {âtypeâ: âstringâ,
âdescriptionâ: âThe name of the angel investor that is investingâ},
âlead_investorâ: {âtypeâ: âbooleanâ,
âdescriptionâ: âTrue if the angel investor was a lead investor in the deal. False otherwiseâ},
âfinancing_idâ: {âtypeâ: âstringâ,
âdescriptionâ: âThe financing id that matches the angel investor to the financingâ}},
ârequiredâ: [âangel_investor_nameâ, âlead_investorâ, âfinancing_idâ],
âadditionalPropertiesâ: False}}},
ârequiredâ: [âinvestorsâ],
âadditionalPropertiesâ: False}}},
Getting responses like this that take up the whole token limit
{âinvestorsâ: }\n\n \n\n \n\n \n\n \n\n \n\n \n\n \n\n \n\n \n\n \n\n \n\n \n\n \n\n \n\n \n\n \n\n \n\n \n\n \n\n \n\n \n\n \n\n \n\n \n\n \n\n \n\n \n\n \n\n \n\n \n\n \n\n \n\n \n\n \n\n \n\n \n\n \n\n \n\n \n\n \n\n \n\n \n\n \n\n \n\n \n\n \n\n \n\n \n\n \n\n \n\n \n\n \n\n \n\n \n\n \n\n â
Important to note:
Letâs imagine your application prompted instead of trained (and training should have a good system prompt anyway to depart from typical AI behavior).
Your json_schema object upgraded to âstrictâ, with formatting undamaged by the forum (chat playground format):
{
"name": "angels_extraction",
"strict": true,
"schema": {
"type": "object",
"properties": {
"investors": {
"type": "array",
"items": {
"type": "object",
"properties": {
"angel_investor_name": {
"type": "string",
"description": "The name of the angel investor that is investing"
},
"lead_investor": {
"type": "boolean",
"description": "True if the angel investor was a lead investor in the deal. False otherwise"
},
"financing_id": {
"type": "string",
"description": "The financing id that matches the angel investor to the financing"
}
},
"required": [
"angel_investor_name",
"lead_investor",
"financing_id"
],
"additionalProperties": false
}
}
},
"required": [
"investors"
],
"additionalProperties": false
}
}
Output:
response_format = {
"type": "json_schema",
"strict": True,
"json_schema": {
"name": "angels_extraction",
"schema": {
"type": "object",
"properties": {
"investors": {
"type": "array",
"items": {
"type": "object",
"properties": {
"angel_investor_name": {
"type": "string",
"description": "The name of the angel investor that is investing",
},
"lead_investor": {
"type": "boolean",
"description": (
"True if the angel investor was a lead investor in the deal. "
"False otherwise"
),
},
"financing_id": {
"type": "string",
"description": (
"The financing id that matches the angel investor to the financing"
),
},
},
"required": [
"angel_investor_name",
"lead_investor",
"financing_id",
],
"additionalProperties": False,
},
},
},
"required": ["investors"],
"additionalProperties": False,
},
},
}
Thanks for the super in depth reply will look into it more. What Iâm moreso worried about is this was working just fine for months and suddenly stopped workingâŚ
Iâve trained on empty results as well. Iâve got a bunch of retries baked in if the json doesnt parse so it catches most of them. But wherever there is smoke there is fire so Iâm somewhat worried about the model quality on entries where there are investors.
Do you really think strict will help? i did a bunch of tests and the issue doesnât really go away if I do that.
you can halt on an immediately closed investors array, and take the âstopâ reason as no results
this is interesting might try it
Iâm honestly probably just going to implement an output parser that validates response format then retries if it fails. Which is kind stupid b/c thats what response_format is supposed to do.
A âstrictâ: true response format is logit enforcement, that should make it impossible to write anything but a valid JSON with the ordered keys, even if you train against it.
However, it can still go haywire when in strings. OpenAI also doesnât seem to enforce a closing token where it is needed, thus the loopity-loops of strings AFTER the object is complete, or repeated JSONs.
âjson_objectâ is just turning on a post-training technique that OpenAI has placed to make their JSON more likely. It doesnât enforce anything, and you must prompt the AI extremely well (or fine tune). Their idea of JSON in weights may directly collide with your single-line output seen in fine-tuning.
Try:
Drop the structured output parameter completely. See your model write again without the weighted corruption caused by JSON mode, or JSON schema enforcement counter to training.
I work at an AI company, and we have been using 4.1-mini since its release. In recent weeks, we have had several errors in the logs when trying to parse the AI responses in JSON. This was extremely rare before, but now we have errors every minute. We use response_format as it is in the documentation and have not changed anything in recent months except for prompts. Basically, the model has been hallucinating when creating structured jsons, responding with an endless series of â}â or â\nâ etc. Has anyone found a solution for this? I would not like to migrate to 5-mini, our customers did not like it.
The only thing I can think of is to add a retry wrapper and have it call again if the json doesnât parse properly.
You can have a last resort call with response_format =text and see if that works if all the retries fails. This catches 99.99% of my errors.