Wrong intent inference benchmark v0.8: full English 80-case result

Follow-up to my earlier post about wrong intent inference.

I published a v0.8 update for strict-intent-bench, a benchmark-first repo focused on cases where an assistant gives a plausible answer, but to the wrong implied question.

The new result is an English v0.3 full 80-case run.

Metric Baseline Strict / Precision v13
Action accuracy 38.8% 66.2%
Wrong intent inference 37.5% 5.0%
Metadata unnecessary clarification 15.0% 11.2%
Missing needed clarification 30.0% 15.0%

The current best intervention is Strict / Precision v13.

The main result is not “strict prompting solves this.” It does not. The stronger claim is narrower:

A stricter action-selection policy can substantially reduce wrong intent inference while still exposing a real clarification trade-off.

I also tested later prompt variants. Some improved targeted failure cases, but regressed on broader checks, so v13 remains the current public champion.

Main remaining weak spots:

The repo includes the benchmark data, prompt variants, evaluation runner, full reports, and the v0.8 decision summary.

Repo:

Live demo: