[URGENT] GPT-4.1-mini consumes unknown amount of input tokens

Today I noticed a very worrying behaviour of 4.1-mini.

I sent 119 input tokens, but receive almost 40x more input tokens in response usage.

Two short messages and almost 5k input tokens. Response contains multiple duplicated structured outputs with the same message identifier, which also counts as output usage.

This is serious issue that can cost someone in production quite a lot.

To reproduce that behaviour you need to configure tools, structured output and drop a unresolvable message.

Here is an example from logs:

Summary 4.1-mini

Repeated output 4.1-mini

The same request to gpt-4.1-nano and gpt-4.1 consumes 119 input tokens as expected.

The most expensive request that I’ve captured.

I would be interested in having you save and share a Prompts Playground preset. Your function and schema are not transparent enough for anyone to replicate your concern.

You have a json_schema. You don’t show us what that is, or if you are using “strict”.

It is unlikely that your items is supposed to have as output “location” or “temperature”, though.

You are using Responses, which has an internal tool iterator. It also clearly is not working correctly to stop output on ANY special token of ChatML container, as you see restarts of “assistant” that the API backend has captured and placed in an output list.

Clearly, responses should catch such a case of any assistant output pattern, such as it repeating the start of a message, which it is not doing.

Here’s what I think is the basic fault showing though:

The AI must be post-trained on calling functions. It must not close the preamble of the chat message container and start an output, but instead must address with a different token to initiate the tool recipient backend handler.

Your AI is going right for the user response. It cannot backtrack and correct itself when it should have called a function (well, technically it can, but it won’t). Thus, you get your schema used.

Then the AI is seeing that it has made an error, there is no function, so it continues repeating, but then basically has less idea how to invoke a function.

This could be fixed on chat completions a bit with logit_bias. Responses is for developer-as-dummy, though, and has no facilities for token level data or manipulation.

Here’s what I would do to stop the concern:

Give the AI an anyOf schema. It would have a second subschema described, “error”, a schema path for when there is insufficient information or improper diversion right to output.

Secondly, you could describe in the function itself a retraining on how to invoke that function. Without going deep into why, you’d write: “To send to this tool recipient you immediately generate ` to`, as in `assistant to`, and you do not begin a normal response”.

Then OpenAI must fix the model, fix the endpoint, and so many other things that make Responses not worth using just for two internal tools that do not perform for a developer’s needs.

1 Like

I tested on both chat completions api and responses api. This is related only to gpt-4.1-mini.

There you have a minimal reproduction using sdk

const response = await this.openai.responses.create({
            model: "gpt-4.1-mini",
            input: [
                {
                    type: "message",
                    role: "developer",
                    content: [{ type: "input_text", text: "You are a helpful assistant." }],
                },
                {
                    type: "message",
                    role: "user",
                    content: [{ type: "input_text", text: "What's the status of task #1?" }],
                },
            ],
            store: true,
            tools: [
                {
                    type: "function",
                    name: "get_locations",
                    parameters: zodToJsonSchema(
                        z.object({}),
                    ),
                    strict: true,
                },
                {
                    type: 'function',
                    name: "get_weather",
                    parameters: zodToJsonSchema(
                        z.object({
                            location: z.string(),
                            temperature: z.number(),
                        }),
                    ),
                    strict: true,
                },
            ],
            text: {
                format: {
                    type: "json_schema",
                    name: "weather-report",
                    schema: zodToJsonSchema(
                        z.object({
                            items: z.array(z.object({
                                location: z.string(),
                                temperature: z.number(),
                            })),
                        }),
                    ),
                    strict: true,
                }
            }
        });

IMO in this scenario it should either refuse or hallucinate single output with correct schema (like 4.1 or 4.1-nano do). There is clearly something wrong under the hood. As I mentioned response contains multiple structured output entries with the same content and identifier.

Related: I’ve noticed that 4.1-mini is surprisingly poor about choosing the right function from a set of functions that are quite similar. Got a suspicion someone in OpenAI ought to review how 4.1 mini and nano work with functions because they are definitely inferior to 4o-mini in this particular regard.

2 Likes