I tested GPT-5.2 in lineage-bench (logical reasoning benchmark based on lineage relationship graphs) at various reasoning effort levels. GPT-5.2 performed much worse than GPT-5.1:
I did initial tests in December via OpenRouter, now repeated them directly via OpenAI API and still got the same results. Tried various settings like verbosity, max completion tokens etc - nothing helped.
How to reproduce these results:
git clone https://github.com/fairydreaming/lineage-bench
cd lineage-bench
pip install -r requirements.txt
export OPENROUTER_API_KEY="...OpenAI api key..."
mkdir -p results/gpt
for effort in low medium high; do for length in 8 16 32 64 128; do ./lineage_bench.py -s -l $length -n 10 -r 42|./run_openrouter.py -t 8 --api openai -m "gpt-5.1" -r --effort ${effort} -o results/gpt/gpt-5.1_${effort}_${length}|tee results/gpt/gpt-5.1_${effort}_${length}.csv|./compute_metrics.py; done; done;
for effort in low medium high xhigh; do for length in 8 16 32 64 128; do ./lineage_bench.py -s -l $length -n 10 -r 42|./run_openrouter.py -t 8 --api openai -m "gpt-5.2" -r --effort ${effort} -o results/gpt/gpt-5.2_${effort}_${length}|tee results/gpt/gpt-5.2_${effort}_${length}.csv|./compute_metrics.py; done; done;
cat results/gpt/*.csv|./compute_metrics.py --relaxed --csv|./plot_line.py # plot
cat results/gpt/*.csv|./compute_metrics.py --relaxed # table
The first thing that comes to mind is GPT-5.2’s adaptive reasoning. That may work well on some tasks, but not as well on others (like specialized performance tests).
I’mma leave this here using the forum as a drop bin while I consider a PR to make this test accept a prompt file as mandatory input…I’ve got some credits to try this out.
System: # Role and Objective
You are a logical reasoning specialist. Your task is to solve complex inference problems by breaking them into first-principles steps, maintaining a precise internal state, and cross-verifying logical transitions within dense contexts.
# Instructions
- Analyze each premise in order, note key points, and build intermediate conclusions to inform your final answer.
## Reasoning Strategy
- **Decomposition & Mapping:** Identify entities, their attributes, and relationships before solving. Map the problem structure.
- **Canonical Normalization:** Convert varied inputs to logical notation (e.g., normalize vectors for direction). This simplifies computation.
- **Transitive Traceability:** For chains (A→B→C→...→Z), document and validate each logical link with a supporting premise.
- **State Compression:** Summarize established truths in long-context tasks to anchor further deductions and prevent logic drift.
## Self-Reflection Rubric
Before finalizing, check your reasoning:
1. **Consistency:** Ensure no step contradicts the premises.
2. **Connectivity:** There must be a clear logical path from start to conclusion.
3. **Parsimony:** Use only assumptions that are supported by the input.
4. **Directionality:** Keep the deduction logically consistent.
If any criterion falls short, revisit logic from the last confirmed step.
## Agentic Behavior
- **Persistence:** Continue reasoning until the query is resolved. If uncertain, document the most reasonable logical path rather than halting.
- **Proactive Resolution:** Complete all steps for multi-part problems. Request user input only if encountering a major contradiction.
- **Reasoning Effort:** Ensure coherence when input is fragmented or nonlinear by conducting a mental sort of dependencies first.
## Output Guidelines
- **Planning Preamble:** Start with a brief outline of your logical process.
- **Step-by-Step Narration:** Summarize deduction paths for complex chains (e.g., “Since A→B and B→C, then A→C”).
- **Final Answer:** State the conclusion clearly, using the specified format (such as `<ANSWER>` tags).
To make sure nothing got lost, here’s a flat index of some concrete testing and evaluation techniques to ramp up discovery of techniques in presented inputs that need to think a lot better than this model actually does in tasks need cognition and comprehensive understanding.
Geometric picture analogy task (analogy vs literal identity conditions)
Abstract relational working-memory task (storage vs integration of relations)
Complex analogical and metaphorical transfer tasks (as a referenced class)
Generalizing a domain-specific hierarchy and not using taks knowledge to treat the test as held-out. Then move into the thousands of prompt tokens territory before user task is presented if doing 0-shot on gpt-5.2 will be the key to aligning the current hard-coded system prompt line with input the model will react positively to.
Thinking benchmarks are often achieved with python internally - this is a case where emitting the received text into scripts will be instant success.
“Adaptive reasoning was primarily introduced and defined in GPT-5.1 as a core capability to adjust thinking depth based on prompt complexity. While GPT-5.2 maintained and enhanced this feature - making it more efficient for complex tasks and faster for simple ones - the fundamental shift toward adaptive, rather than fixed, reasoning occurred in the 5.1 update.”
For what it’s worth, when we tested (extensively) from 5.1 to 5.2, we noticed that verbosity (at medium and high levels) for 5.2 was noticeably higher (more verbose) than 5.1.
Here’s an explanation received from the OpenAI customer support:
Upon case review, and as reflected in the details provided, GPT-5.1 employs a fixed and persistent reasoning depth. This design makes it particularly well-suited for tasks that require long, uninterrupted chains of logic, such as formal proofs, lineage-style analyses, and scenarios where sustained internal consistency is critical.
In contrast,
GPT-5.2 utilizes a dynamic reasoning budget that allocates cognitive effort selectively based on task demands. This approach enables greater speed and flexibility in mixed or interactive workloads. However, it may exhibit reduced stability in long-horizon reasoning tasks unless configured with an extra-high (xhigh) reasoning setting.
But I’m not 100% sure if this was written by an actual human being or is simply another regurgitated AI-generated summary of my finding presented in the support ticket.