How could context-aware evaluation improve interpretability in LLMs?

I’ve been thinking about how evaluation and safety systems might evolve to become more context-aware, especially when large language models deal with complex or nuanced language — irony, metaphor, philosophy, emotional tone, etc.

Traditional filters tend to rely on lexical or statistical cues.
That’s efficient, but it often misreads intent: models get flagged not for what they mean, but for what they say literally.
It made me wonder whether evaluation could include a more dialogic step — one that interprets meaning before judging it.

I’ve been experimenting with a conceptual structure for evaluation that acts as a dialogic intermediary between model outputs and safety filters.
The idea is not to replace existing systems, but to mediate them — to interpret intent before judgment, and to generate an explanatory report instead of a binary verdict.

Key aspects I’ve been exploring:
Contextual explainability — evaluating intent instead of just lexical patterns.
Argumentative accountability — making the reasoning behind moderation steps transparent.
Ethical adaptivity — applying rules proportionally to context.
Human oversight — keeping humans central in ambiguous cases.

There’s a short conceptual write-up on Zenodo for anyone interested in the details.

But I’d really like to hear from others:
:backhand_index_pointing_right: How realistic do you think this kind of “dialogic” evaluation could be in current LLM pipelines?
Could such a system actually help reduce false positives while improving interpretability?