Approaches for monitoring quality of reasoning capabilities in production

jr.2509 · March 24, 2024, 5:46am

Hi -

I am wondering how other developers are approaching the monitoring of models’ reasoning capabilities and the detection of degradation in a live production environment?

Context: I’ve built an application that ingests certain types of documents as input, then automatically performs a series of analyses over these documents over a period of 15-25 minutes, and then returns a structured report with the analysis outcomes for further review by a human. There’s already all sorts of technical controls built in as well as various output validation techniques.

My remaining concern are transient degradations in models’ reasoning capabilities that could impair the analysis quality and, in the worst case, remain undetected by the validation controls.

One of the ideas in my mind is to put in place independent periodic performance checks that would involve presenting the model with a set of advanced reasoning capabilities questions with known outcomes against which model responses would be automatically evaluated/scored, In the event of material deviations the application would be temporarily placed on hold until performance has been restored to a normal level.

Would be grateful to hear some additional perspectives on how others have or would approach this?

Thanks!

Foxalabs · March 24, 2024, 5:52am

I think that approach is sound, but with one caveat, the periodic performance checks should be constructed such that a minimum standard is tested for, not a fixed one, i.e. you should allow for the model to improve as well as degrade.

OpenAI have a tool that may be of use for this in the Evals framework, it can be found here:

jr.2509 · March 24, 2024, 5:56am

Thank you! That point makes a lot of sense and I will definitely look into the referenced tool.

Topic		Replies	Views
Prompt Regression Testing - API Usage Prompting api , prompt-engineering	10	281	February 14, 2025
LLM and Prompt Evaluation Frameworks Prompting prompt-engineering , prompting , evals	11	5709	December 16, 2024
What's your process for automated testing for AI agents? Community api	12	6375	January 22, 2025
Model Sliding: A Logical Approach to AI Model Selection Community gpt-4 , chatgpt , api	7	1349	July 12, 2023
How to test an API, built on GPT? API	2	2282	April 9, 2024

Approaches for monitoring quality of reasoning capabilities in production

Related topics