I think I've found a much better way to evaluate an LLM's true intelligence

I’ve tried to tweet OpenAI, but my tweets don’t reach you because I’m so inactive on twitter. I think I’ve found a much better way to evaluate an AI’s intelligence. Please read my tweets, and tell me if my testing method is a valid testing method.

And if you agree it’s a good way to test, then please retweet to let it reach the other companies as well.

x[dot]com/EvertvanBrussel/status/1832550630163898841

Hi and welcome back!

If your goal is to discuss whether your approach is valuable to OpenAI and other researchers, I suggest editing your initial post to include information about what your approach entails, how it works, and why it would be a better solution compared to those currently in use.

Otherwise, simply posting a link to another link isn’t the most helpful way to start a conversation.

Okay well basically: I recently watched Atomic Blonde and Death Note, in both the movie and the show the protagonist finds themself in an extremely challenging situation. Where they have a certain goal in mind they want to achieve, but there are lots of antagonists working against them. And some allies who create more problems simply by making some honest mistakes, because they just aren’t as smart as the protagonist. And most importantly, in Atomic Blonde at least, there’s a character who is supposed to be an ally, but turns out to be a traitor.

I imagine that if a bunch of humans (preferably highly skilled story writers) would engage with an LLM in which they put the LLM in such a story. And especially the traitor should be written by the most intelligent and cunning human. Then if the LLM can actually overcome the obstacles and achieve its goal, that to me would be a real sign of intelligence.

Of course since the human writers won’t know exactly how the LLM will respond to each message, they will need to improvise on the spot. However, this also has an upside, because this means that future LLMs’ training data cannot accidentally include this test. (Or if it does, the writers can simply begin with a different premise and the problem is solved.)

You could even automate this test to some degree. By letting some or all of the characters (which would normally be written by humans) be written by a variety of the top LLMs at the time. But it would be important that every individial charachter would be written by a unique LLM, so that it hopefully brings in more creativity / unpredictability and unique challenges for the AI (that’s being tested) to overcome. Although I think for now, the LLMs we have wouldn’t be smart enough to make this test scenario both challenging and make sense still.

1 Like