This is very task-based, and heavily based on existing artifacts logged. Like the idea of “distillation” which was discarded, this has very little application to real-world chat and must have very narrow task focus to be employed.
Variables is important, or templating: it is where the changing part is inserted.
Then “judging”. There’s different graders you can write, from text similarity scoring to asking an AI to score. The ultimate output is a boolean or threshold.
But yes, you are asking an AI if an AI answered well. Like asking a schoolchild if another schoolchild answered about particle physics well.
It is a surface that upon investigation, immediately informs that you’d write your own implementation instead of investing learning time, besides that there are limited “smart” models to either generate or judge in a semi-deterministic fashion.