Hi -
I am wondering how other developers are approaching the monitoring of models’ reasoning capabilities and the detection of degradation in a live production environment?
Context: I’ve built an application that ingests certain types of documents as input, then automatically performs a series of analyses over these documents over a period of 15-25 minutes, and then returns a structured report with the analysis outcomes for further review by a human. There’s already all sorts of technical controls built in as well as various output validation techniques.
My remaining concern are transient degradations in models’ reasoning capabilities that could impair the analysis quality and, in the worst case, remain undetected by the validation controls.
One of the ideas in my mind is to put in place independent periodic performance checks that would involve presenting the model with a set of advanced reasoning capabilities questions with known outcomes against which model responses would be automatically evaluated/scored, In the event of material deviations the application would be temporarily placed on hold until performance has been restored to a normal level.
Would be grateful to hear some additional perspectives on how others have or would approach this?
Thanks!