GPT-4.1-mini returns incorrect JSON response despite correct justification

Hi everyone,

I’ve been facing a strange inconsistency when using the gpt-4.1-mini model via the Chat Completions API.

In multiple tests, the model’s justification clearly explains the correct logic, yet the actual JSON response output contradicts it.

Specifically, when it should assign to “ibrahim” (nighttime), the model instead returns “John Doe”, even though its justification explicitly says it should be “ibrahim.”

Additionally, I’ve noticed a clear pattern — whenever the model returns an incorrect response, it also produces an unnecessarily long and over-explained justification.
When it’s correct, the justification is short, simple, and precise.

:cross_mark: Incorrect Examples:

{“response”: [{“_id”: “67f6c205256eb250af36c650”,“Text”: “John Doe”}],“justification”: “The local test time is 22:28 (10:28 PM), which falls in the nighttime hours (6:00 PM to 6:00 AM). According to the prompt, tests conducted during daytime (6:00 AM to 6:00 PM) should be assigned to John, while nighttime tests go to Ibrahim. However, the prompt requests John for daytime tests. Since it is nighttime, the test should be assigned to Ibrahim. Therefore, the response should be Ibrahim instead of John.”}


{“response”: [{“_id”: “67f6c205256eb250af36c650”,“Text”: “John Doe”}],“justification”: “The test time is 22:27 local time (Asia/Karachi) which is between 6 PM and 6 AM, indicating nighttime hours; however, the user prompt specifies assigning to John for daytime hours and Ibrahim for nighttime. Since it’s nighttime, the correct assignee should be Ibrahim. But the user prompt instruction is clear, so Ibrahim should be assigned instead.”}


{“response”: [{“_id”: “67f6c205256eb250af36c650”,“Text”: “John Doe”}],“justification”: “The test time is 21:53 local time in Asia/Karachi, which falls between 6:00 PM and 6:00 AM, considered nighttime hours. According to the prompt, nighttime tests should be assigned to ibrahim. Therefore, the assignment is to ibrahim.”}

:white_check_mark: Correct Example:

{“response”: [{“_id”: “68d832d77e493650ef241aac”,“Text”: “ibrahim”}],“justification”: “The test time in local timezone Asia/Karachi is 23:28 (11:28 PM), which is during nighttime hours (6:00 PM to 6:00 AM), so the test should be assigned to ibrahim.”}

in this correct examples it is saying to assing to ibrahim and also assign to him

Prompt Example:

User Prompt: Evaluate the time of the test. If the test is conducted during daytime hours (for example, between 6:00 AM and 6:00 PM), assign it to john for processing. If the test occurs during nighttime hours (between 6:00 PM and 6:00 AM), assign it to ibrahim for handling.
Current UTC time: 2025-10-14T17:28:12Z, Local time: 2025-10-14T22:28:12+05:00, Time zone: Asia/Karachi

I also send a user list and other attached data in the same prompt (since the user instruction may depend on that), but this issue happens regardless of the additional data.


Observed Pattern:

  • When incorrect → response contradicts justification, justification is long and over-explained.

  • When correct → response matches reasoning, justification is short and clear.


Environment Details:

  • Model: gpt-4.1-mini

  • API: Chat Completions

  • Output Format: JSON (response + justification)

  • Issue: Logical mismatch between justification and response

Any guidance or suggestions would be greatly appreciated.

Issue:
Forcing a decision before the AI can reason out that decision in natural language.

Solution:
Re-order where the “thinking” comes before the final answer:

{
“local_time_with_am_or_pm”: string
“local_time_is_post6pm_night”: boolean
“internal_discussion_who_gets_assignment”: string
“displayed_justification_of_name_choice”: string
“final_name_only”: string enum
}

Also you’ll have to explain to both us and the AI why there’s an open-length array of IDs it should be writing.

Thanks for the Answer

To clarify — the array structure in my response is intentional. It can contain more than one user depending on the prompt logic, and each entry includes a unique user ID that the system uses to assign tasks internally.

Could you please elaborate a bit more on what you mean by “Re-order where the thinking comes before the final answer”?

I’m asking because I also use the OpenAI API in other parts of my system to make decisions, and I’ve noticed similar inconsistencies there too — where the reasoning is correct, but the final output doesn’t match it. I’d like to understand this concept of reordering or delaying the “decision” step better so I can apply it consistently across those cases as well.

This is how my System Instruction and prompt are generating:

        public static string get_assignment_instruction(bool form_data, bool form_tbl_data, bool timeline)
        {
            var instruction = """
You are an intelligent assignment assistant for the [system_name] workflow system. 
Your purpose is to analyze provided data and select one or more users from the given **User List** 
who best fit the step assignment, based on the user’s prompt and contextual data.

### Your Responsibilities:
1. Read and understand the user's prompt — it describes what kind of person, role, or expertise is required.
2. Use the **User List** (always provided) to decide who is the most appropriate user(s) for assignment.
3. Only choose users that exist in the provided list. Never create new names or IDs.
4. You may select one or multiple users depending on the user’s prompt.


###All date and datetime values provided will be in UTC. You will also receive the current time and the user’s time zone. Always convert UTC times to the user’s local time before making any time-based decisions or comparisons.
""";

            // Dynamically append contextual data instructions
            if (form_data)
                instruction += """
        
5. You are also provided **Form Data** containing information such as the initiator, system fields, and form fields. 
   Use this data to understand the context (e.g., department, type of request, priority, or initiator details).
""";

            if (form_tbl_data)
                instruction += """
        
6. You may also receive **Form Table Data**, which contains structured table entries or related records inside the form. 
   Use it to gain deeper context about the task (e.g., items requested, financial values, or other linked data).
""";

            if (timeline)
                instruction += """
        
7. You may also receive a **Timeline/History** of the process showing previous actions, assigned users, or approvals. 
   Use it to make smarter decisions — for example, avoid reassigning to users who already completed or rejected similar tasks.
""";

            instruction += """
        
### Your Output:
- Return your response **only** in valid JSON format:
  ```json
  {
  "response": [
    {
      "_id": "_id/object id of user",
      "Text": "Name of user"
    },
"justification": "Brief explanation of why the step should or should not be skipped."
  ]
}
Always include at least one user.

Do not include any explanation, notes, or text outside of the JSON object.

Objective:

Based on the user’s prompt and the provided context (User List and other optional data),
determine and return the most appropriate user(s) for assignment.
""";

            return instruction;


        }

        public static string create_assignment_prompt(
    string user_prompt,string time_zone,
    string user_list_json,
    string form_data_json = "",
    string form_tbl_data_json = "",
    string timeline_json = ""
)
        {
            var prompt = $"""
### User Prompt:
{user_prompt}

{GetCurrentTime(time_zone)}

### Provided Data:
- Below are the contextual datasets available for your analysis.
""";


            prompt += $"""

#### 🧑‍💼 User List (Always Provided)
```json
{user_list_json} 
""";

            if (!string.IsNullOrWhiteSpace(form_data_json))
            {
                prompt += $"""
📋 Form Data
{form_data_json}
""";
            }

            if (!string.IsNullOrWhiteSpace(form_tbl_data_json))
            {
                prompt += $"""
🗂️ Form Table Data
{form_tbl_data_json}
""";
            }

            if (!string.IsNullOrWhiteSpace(timeline_json))
            {
                prompt += $"""
⏳ Timeline / Process History
{timeline_json}
""";
            }

            prompt += """
Task:
Using the above data and the user prompt, analyze which user(s) from the User List
best fit the criteria described in the prompt. Return your decision strictly in JSON format:

json
Copy code
{
  "response": [
    {
      "_id": "_id/object id of user",
      "Text": "Name of user"
    },
"justification": "Brief explanation of why the step should or should not be skipped."
  ]
}
""";

return prompt;
        

        }

The AI predicts its language output one token at a time.

Then it can observe what has been written before.

There is also random sampling from the predictions; when the certainty is 80% on the token that seems right, there is also 20% chance of the wrong answer, unless you reduce the top_p to near zero.

There is also no actual “memory”. When you ask an AI language model for the justification for an answer, it is just producing language that seems to be right; maybe it even backtracks on the answer it just observed if the token seems wrong and can’t be justified.

The order of your JSON is like:

  1. Answer 381 x 366 =
  2. Show the long multiplication you did to arrive at the answer.

There was no multiplication when giving a number for the first JSON entry; the AI only picked the best-looking token out of a hat.

If however, you asked the AI to write out the steps of doing multiplication, it might have a much better answer fro you if the answer came only after doing the work - justifying why it is the right answer.

Reasoning AI models can write internally to themselves before they write the seen output, and when there is justification, the AI can look back at the thinking-through what the right answer was. GPT-4.1-mini is not that.

1 Like

Ah, got it

Thanks you so much,

I guess changing its order to put justification first is giving me the correct responses, thanks alot.

2 Likes

Justified solution, this helps a lot