Approaches for monitoring quality of reasoning capabilities in production

Hi -

I am wondering how other developers are approaching the monitoring of models’ reasoning capabilities and the detection of degradation in a live production environment?

Context: I’ve built an application that ingests certain types of documents as input, then automatically performs a series of analyses over these documents over a period of 15-25 minutes, and then returns a structured report with the analysis outcomes for further review by a human. There’s already all sorts of technical controls built in as well as various output validation techniques.

My remaining concern are transient degradations in models’ reasoning capabilities that could impair the analysis quality and, in the worst case, remain undetected by the validation controls.

One of the ideas in my mind is to put in place independent periodic performance checks that would involve presenting the model with a set of advanced reasoning capabilities questions with known outcomes against which model responses would be automatically evaluated/scored, In the event of material deviations the application would be temporarily placed on hold until performance has been restored to a normal level.

Would be grateful to hear some additional perspectives on how others have or would approach this?



I think that approach is sound, but with one caveat, the periodic performance checks should be constructed such that a minimum standard is tested for, not a fixed one, i.e. you should allow for the model to improve as well as degrade.

OpenAI have a tool that may be of use for this in the Evals framework, it can be found here:


Thank you! That point makes a lot of sense and I will definitely look into the referenced tool.