Low logical reasoning performance of GPT-5.2 at medium and high reasoning effort levels

I tested GPT-5.2 in lineage-bench (logical reasoning benchmark based on lineage relationship graphs) at various reasoning effort levels. GPT-5.2 performed much worse than GPT-5.1:

As you can observe on the plot above:

  • GPT-5.2 xhigh performed fine, about the same level as GPT-5.1 high - no problems here,

  • GPT-5.2 medium and high performed worse than GPT-5.1 medium and even low (for more complex tasks) - this is unexpected,

  • GPT-5.2 medium and high performed almost equally bad - there is little difference in their scores,

  • GPT-5.2 low performed much worse than GPT-5.1 low.

I expected the opposite - in other reasoning benchmarks like ARC-AGI GPT-5.2 has higher scores than GPT-5.1 for corresponding reasoning effort levels.

Benchmark results in a form of table:

|   Nr | model_name       |   lineage |   lineage-8 |   lineage-16 |   lineage-32 |   lineage-64 |   lineage-128 |
|-----:|:-----------------|----------:|------------:|-------------:|-------------:|-------------:|--------------:|
|    1 | gpt-5.2 (xhigh)  |     1.000 |       1.000 |        1.000 |        1.000 |        1.000 |         1.000 |
|    2 | gpt-5.1 (high)   |     0.980 |       1.000 |        1.000 |        1.000 |        0.950 |         0.950 |
|    2 | gpt-5.1 (medium) |     0.980 |       1.000 |        1.000 |        0.975 |        0.975 |         0.950 |
|    4 | gpt-5.1 (low)    |     0.815 |       1.000 |        0.950 |        0.925 |        0.875 |         0.325 |
|    5 | gpt-5.2 (high)   |     0.790 |       1.000 |        1.000 |        0.975 |        0.825 |         0.150 |
|    6 | gpt-5.2 (medium) |     0.775 |       1.000 |        1.000 |        0.950 |        0.775 |         0.150 |
|    7 | gpt-5.2 (low)    |     0.660 |       1.000 |        0.975 |        0.800 |        0.400 |         0.125 |

I did initial tests in December via OpenRouter, now repeated them directly via OpenAI API and still got the same results. Tried various settings like verbosity, max completion tokens etc - nothing helped.

How to reproduce these results:

git clone https://github.com/fairydreaming/lineage-bench
cd lineage-bench
pip install -r requirements.txt
export OPENROUTER_API_KEY="...OpenAI api key..."
mkdir -p results/gpt
for effort in low medium high; do for length in 8 16 32 64 128; do ./lineage_bench.py -s -l $length -n 10 -r 42|./run_openrouter.py -t 8 --api openai -m "gpt-5.1" -r --effort ${effort} -o results/gpt/gpt-5.1_${effort}_${length}|tee results/gpt/gpt-5.1_${effort}_${length}.csv|./compute_metrics.py; done; done;
for effort in low medium high xhigh; do for length in 8 16 32 64 128; do ./lineage_bench.py -s -l $length -n 10 -r 42|./run_openrouter.py -t 8 --api openai -m "gpt-5.2" -r --effort ${effort} -o results/gpt/gpt-5.2_${effort}_${length}|tee results/gpt/gpt-5.2_${effort}_${length}.csv|./compute_metrics.py; done; done;
cat results/gpt/*.csv|./compute_metrics.py --relaxed --csv|./plot_line.py # plot
cat results/gpt/*.csv|./compute_metrics.py --relaxed # table

API requests and responses generated when running the benchmark: https://github.com/fairydreaming/lineage-bench-results/tree/main/lineage-8_16_32_64_128

Can anyone explain this behavior?

2 Likes

The first thing that comes to mind is GPT-5.2’s adaptive reasoning. That may work well on some tasks, but not as well on others (like specialized performance tests).

I’mma leave this here using the forum as a drop bin while I consider a PR to make this test accept a prompt file as mandatory input…I’ve got some credits to try this out.

System: # Role and Objective
You are a logical reasoning specialist. Your task is to solve complex inference problems by breaking them into first-principles steps, maintaining a precise internal state, and cross-verifying logical transitions within dense contexts.

# Instructions
- Analyze each premise in order, note key points, and build intermediate conclusions to inform your final answer.

## Reasoning Strategy
- **Decomposition & Mapping:** Identify entities, their attributes, and relationships before solving. Map the problem structure.
- **Canonical Normalization:** Convert varied inputs to logical notation (e.g., normalize vectors for direction). This simplifies computation.
- **Transitive Traceability:** For chains (A→B→C→...→Z), document and validate each logical link with a supporting premise.
- **State Compression:** Summarize established truths in long-context tasks to anchor further deductions and prevent logic drift.

## Self-Reflection Rubric
Before finalizing, check your reasoning:
1. **Consistency:** Ensure no step contradicts the premises.
2. **Connectivity:** There must be a clear logical path from start to conclusion.
3. **Parsimony:** Use only assumptions that are supported by the input.
4. **Directionality:** Keep the deduction logically consistent.
If any criterion falls short, revisit logic from the last confirmed step.

## Agentic Behavior
- **Persistence:** Continue reasoning until the query is resolved. If uncertain, document the most reasonable logical path rather than halting.
- **Proactive Resolution:** Complete all steps for multi-part problems. Request user input only if encountering a major contradiction.
- **Reasoning Effort:** Ensure coherence when input is fragmented or nonlinear by conducting a mental sort of dependencies first.

## Output Guidelines
- **Planning Preamble:** Start with a brief outline of your logical process.
- **Step-by-Step Narration:** Summarize deduction paths for complex chains (e.g., “Since A→B and B→C, then A→C”).
- **Final Answer:** State the conclusion clearly, using the specified format (such as `<ANSWER>` tags).

To make sure nothing got lost, here’s a flat index of some concrete testing and evaluation techniques to ramp up discovery of techniques in presented inputs that need to think a lot better than this model actually does in tasks need cognition and comprehensive understanding.

  1. Familiar-content categorical syllogism tasks
  2. Unfamiliar / abstract-letter categorical syllogism tasks
  3. Belief-bias / inhibitory belief problems in syllogistic reasoning
  4. Categorical syllogism task with:
  • Deduction condition (validity judgments)
  • Induction condition (probability judgments)
  • Baseline anomalous-content condition
  1. Conditional syllogism deduction tasks
  2. Conditional syllogism induction tasks
  3. Transitive inference tasks with geometric shapes
  4. Rule-based category learning task:
  • Rule application condition
  • Rule inference condition
  • Easy vs difficult rule versions and their difficulty interaction
  1. Probabilistic estimation / relative-frequency tasks (e.g., “How fast do race horses gallop?”)
  2. Tower of Hanoi puzzle (well-structured planning task)
  3. Multiple Errands Task / MET (ill-structured real-world planning task)
  4. Plan formation vs plan execution analysis of planning tasks
  5. Water Jug problem / Luchins water-jar task
  6. Michotte launching event causal perception task
  7. Michotte launching vs noncausal control with:
  • Causal judgment vs direction-of-motion judgment
  1. Split-brain tests of:
  • Causal perception (Michotte)
  • Causal inference (non-perceptual causal reasoning tasks)
  1. Drug-effectiveness causal reasoning / theory–evidence consistency task
  2. Raven’s Progressive Matrices (0-, 1-, 2-relational problems)
  3. Geometric picture analogy task (analogy vs literal identity conditions)
  4. Abstract relational working-memory task (storage vs integration of relations)
  5. Complex analogical and metaphorical transfer tasks (as a referenced class)

Generalizing a domain-specific hierarchy and not using taks knowledge to treat the test as held-out. Then move into the thousands of prompt tokens territory before user task is presented if doing 0-shot on gpt-5.2 will be the key to aligning the current hard-coded system prompt line with input the model will react positively to.

Thinking benchmarks are often achieved with python internally - this is a case where emitting the received text into scripts will be instant success.

Wasn’t adaptive reasoning introduced with the GPT-5.1 model? I mean if it’s the cause of the problem then GPT-5.1 would be affected as well.

By the way I checked the mean number of tokens generated when solving lineage-64 (lineage graphs with 64 nodes) problems.

For GPT-5.1:

  • low - 1865 tokens
  • medium - 3362 tokens
  • high - 6731 tokens

For GPT-5.2:

  • low - 938 tokens
  • medium - 2181 tokens
  • high - 2070 tokens
  • xhigh - 4609 tokens

GPT-5.2 definitely is more frugal than GPT-5.1 when generating tokens.

From what I know, adaptive reasoning changed in GPT-5.2. You may want to look into that.

@glenn.haugen @sszymczy

“Adaptive reasoning was primarily introduced and defined in GPT-5.1 as a core capability to adjust thinking depth based on prompt complexity. While GPT-5.2 maintained and enhanced this feature - making it more efficient for complex tasks and faster for simple ones - the fundamental shift toward adaptive, rather than fixed, reasoning occurred in the 5.1 update.”

For what it’s worth, when we tested (extensively) from 5.1 to 5.2, we noticed that verbosity (at medium and high levels) for 5.2 was noticeably higher (more verbose) than 5.1.

@jeffvpace Did you count the reasoning tokens separately? What was the task?

Found in this PDF: “GPT-5.2 introduced improved adaptive reasoning capabilities, making the model more efficient than ever.”

I have to admit that while it fails to solve tasks in my benchmark, it does it very efficiently. :wink:

Here’s an explanation received from the OpenAI customer support:

Upon case review, and as reflected in the details provided, GPT-5.1 employs a fixed and persistent reasoning depth. This design makes it particularly well-suited for tasks that require long, uninterrupted chains of logic, such as formal proofs, lineage-style analyses, and scenarios where sustained internal consistency is critical.

In contrast,

GPT-5.2 utilizes a dynamic reasoning budget that allocates cognitive effort selectively based on task demands. This approach enables greater speed and flexibility in mixed or interactive workloads. However, it may exhibit reduced stability in long-horizon reasoning tasks unless configured with an extra-high (xhigh) reasoning setting.

But I’m not 100% sure if this was written by an actual human being or is simply another regurgitated AI-generated summary of my finding presented in the support ticket.

The customer support response correlates with my own experience using both models.