Interesting Research: Using Bipartite Graphs to prove GPT-4 can Understand Text

I came across this article on my feed today, and I thought our community would be pretty intrigued by this.

Extremely fascinating research, and seems to help prove that yes, GPT-4 is quite adept at language understanding than most people realize. Something I’ve been saying for quite some time now. The nifty part is that these guys were able to use math to measure the closed-sourced models by proxy and come to this conclusion that some of us power-users have long come to notice.

The theory in this paper was created by someone from Princeton and a research scientist at Google DeepMind, which makes this even more interesting.

So now I have research I can finally point to instead of just claiming “experience” when using these systems :slightly_smiling_face:

(This may also be giving me a lot of ideas to conduct my own research)


Here’s the paper without meandering fluff and irrelevant links:

This work introduces SKILL-MIX, a new evaluation to measure ability to combine skills. Using a list of N skills the evaluator repeatedly picks random subsets of k skills and asks the LLM to produce text combining that subset of skills. Since the number of subsets grows like Nk, for even modest k this evaluation will, with high probability, require the LLM to produce text significantly different from any text in the training set. The paper develops a methodology for (a) designing and administering such an evaluation, and (b) automatic grading (plus spot-checking by humans) of the results using GPT-4 as well as the open LLaMA-2 70B model.

Flaw 1: GPT-4 grades itself;
Flaw 2: human CS students as graders and not diverse domain experts;
Flaw 3: assuming that GPT-4 isn’t indeed a highly-supervised parrot of millions and millions of tuned instructions.
Flaw 4: thinking that “insert turns of phrase” into responses has value.

Summarized “skill” evaluation being done: Talk like a pirate while speaking ironically.

Effort to provide some further evaluation can be admired, but the target should be efficacy in obedience and problem-solving.


Interesting approach.

Skimmed the paper and they are using combinatorics to make a compelling argument that what they are asking of the LLM wasn’t present in the training data, and therefore must have “emerged” from within the model.

Maybe this is a good way to benchmark LLM’s beyond standard“trainable knowledge” benchmarks, and get into things that cannot be trained, at least from a statistical perspective.

Right now the best way is to use human evals, and subjective opinions. But this may shine the way to a systematic approach to benchmarking models, beyond human opinions.

Spoiler alert, GPT-4 is still king. :rofl:


Yeah, it’s definitely not perfect, as _j points out:

which are legitimate flaws. However, what I really appreciate about this paper is that it does provide a great starting point towards more in-depth and comprehensive assessments, and could help provide more qualitative data for training.

The biggest flaws I think are actually pretty fixable by just resolving this:

human CS students as graders and not diverse domain experts

Looking into the paper, I could definitely tell this was not developed alongside domain experts in these fields of knowledge, ahem linguistic experts.

but what if it was tweaked to handle exactly that?

I could easily pick a more selective set of linguistic “skills” and vocabulary terms that encompasses a more honed-in assessment that could be graded by actual experts and used to better determine language comprehension and reasoning. This could also eventually translate towards other jargon in other specified domains as we get a better grasp of how these models appear to understand things.

As we saw with AlphaGeometry as well, training a model for a particular skill (using qualitative data made from domain experts) allows it to extrapolate such skills for use in problems/queries it hasn’t seen before. With this particular evaluation technique, one could both assess and fine-tune for particular skills they have domain knowledge in that’s harder to make training data for, potentially improving both its comprehension and usefulness.

Effort to provide some further evaluation can be admired, but the target should be efficacy in obedience and problem-solving.

I think there’s a big jump here that helps me to explain why I find this kind of research, while imperfect, important.

Obedience and problem-solving cannot be brought into question until we have reasonable evidence LLMs can properly interpret the request we want it to obey, and this interpretation (or emulation thereof) can be enhanced to a degree where we can rule this out as a fundamental obstacle to why it would disobey or not solve a problem correctly.

Because it’s not: Utterance → Obey → Solve Problem,
It’s: Utterance → Interpret request → Obey → Solve Problem

Having better evaluation approaches like this one allows us to see what exactly it is lacking, and how we could construct ways to improve it.

If that’s partly what was meant, then my apologies. But this is my thinking on this at least. Which mostly boils down to this lol: