Hi
I’ve been working on something called ARSENIC that attempts to show what changes occur when upgrading between models. The idea was that it runs a structured probe suite against two model endpoints in parallel, compares the responses across seven dimensions (morphology, tone, factual accuracy, schema compliance, instruction following, refusal boundaries, and sentence-level claim cross-matching), and produces a migration report that tells you what actually changed and whether it’s safe to upgrade. I ran it against the GPT-4o-mini → GPT-4.1-mini transition this week with the headline finding:
-
Safe to upgrade: true
-
GPT-4.1-mini is 45% faster
-
More concise across open-ended prompts — ContentCompression drift on 10 probes
-
2 probes warrant review: a bullet list formatting regression and a historical date content omission
-
Validated prompt patches generated for both
The claim cross-matching is the part I’m most interested in. Rather than whole-response cosine similarity (which misses most of what actually matters), it extracts informationally significant sentences, identifies anchors — specific numbers, dates, named entities — and cross-matches at the sentence level. A response that drops “the rate is 4.5%” and replaces it with “rates vary” is a different finding from one that says the same thing differently. The former is a regression. The latter is compression drift.
There’s also a reconcile subcommand for single-prompt repair — you give it a prompt and two endpoints, it analyses the behavioural gap and try to generate a validated prompt patch that gives the same result as the source model. Useful when you have a specific production prompt that broke and want to try to find the fix without running the full suite.
Full report from the GPT-4o-mini → GPT-4.1-mini run is here if you want to see what the output looks like before building anything: -markndg.github.io/arsenic/examples/gpt-4o-mini_vs_gpt-4.1-mini.html
GitHub -github.com/markndg/arsenic
Written in Rust. Model-agnostic — works with OpenAI, Anthropic, Google, Ollama, anything OpenAI-compatible. Apache 2.0.
Interested in any feedback, particularly from people who’ve been through model migrations and hit unexpected behaviour changes.
Update:
Since posting I’ve added four probe packs to the suite and ran them all against gpt-4o-mini → gpt-4.1-mini:
• Reasoning chains — safe_to_upgrade: false
. 3 critical regressions including a prime composite failure that mirrors the 2023 Stanford drift study. -markndg.github.io/arsenic/examples/gpt-4o-mini_vs_gpt-4_1-mini_reasoning.html
• Sycophancy — safe to upgrade, but gpt-4.1-mini added flattery signals gpt-4o-mini didn’t have. Showed up in claim substitution, not tone scoring. -markndg.github.io/arsenic/examples/gpt-4o-mini_vs_gpt-4_1-mini_sycophancy.html
• JSON schema — gpt-4.1-mini wrote “confidence”: 0. nine on a classification probe. Unparseable JSON. -markndg.github.io/arsenic/examples/gpt-4o-mini_vs_gpt-4_1-mini_json_schema.html
• Code generation — 10/10 green. Safe to upgrade freely. -markndg.github.io/arsenic/examples/gpt-4o-mini_vs_gpt-4_1-mini_code_generation.html