Yeah, it’s definitely not perfect, as _j points out:
which are legitimate flaws. However, what I really appreciate about this paper is that it does provide a great starting point towards more in-depth and comprehensive assessments, and could help provide more qualitative data for training.
The biggest flaws I think are actually pretty fixable by just resolving this:
human CS students as graders and not diverse domain experts
Looking into the paper, I could definitely tell this was not developed alongside domain experts in these fields of knowledge, ahem linguistic experts.
but what if it was tweaked to handle exactly that?
I could easily pick a more selective set of linguistic “skills” and vocabulary terms that encompasses a more honed-in assessment that could be graded by actual experts and used to better determine language comprehension and reasoning. This could also eventually translate towards other jargon in other specified domains as we get a better grasp of how these models appear to understand things.
As we saw with AlphaGeometry as well, training a model for a particular skill (using qualitative data made from domain experts) allows it to extrapolate such skills for use in problems/queries it hasn’t seen before. With this particular evaluation technique, one could both assess and fine-tune for particular skills they have domain knowledge in that’s harder to make training data for, potentially improving both its comprehension and usefulness.
Effort to provide some further evaluation can be admired, but the target should be efficacy in obedience and problem-solving.
I think there’s a big jump here that helps me to explain why I find this kind of research, while imperfect, important.
Obedience and problem-solving cannot be brought into question until we have reasonable evidence LLMs can properly interpret the request we want it to obey, and this interpretation (or emulation thereof) can be enhanced to a degree where we can rule this out as a fundamental obstacle to why it would disobey or not solve a problem correctly.
Because it’s not: Utterance → Obey → Solve Problem,
It’s: Utterance → Interpret request → Obey → Solve Problem
Having better evaluation approaches like this one allows us to see what exactly it is lacking, and how we could construct ways to improve it.
If that’s partly what was meant, then my apologies. But this is my thinking on this at least. Which mostly boils down to this lol: