GPT-4 and the ARC Challenge

The ARC Challenge has been going on for some years, and has proven quite difficult for both ML and symbolic AI approaches (achieving only 1% and 29% correct on a hidden test set respectively). I ran an evaluation with GPT-4 on a public test set, which is easier than the hidden test set. In the hardest condition (0-shot prompt with no explanation), GPT-4 got 7% correct. This is good given how hard the task is and the fact that even some fine-tuned models have achieved only that level of success (my personal fine-tuned models have achieved up to 17% on the public 100-item test set after 10-months of work). Humans achieve around 80% accuracy, so LLM’s have a long way to go to achieve this level of performance.

If you want to see the responses from GPT-4, scroll down to the bottom of this jupyter notebook to see the output. Don’t try to run the notebook, because it won’t work. You’ll see each problem followed by a visualization of the correct answer, GPT-4’s response, and the response of our best model. It will look something like this.

We are working on a more complete evaluation with GPT-4 and other models, but I thought I would share this initial finding.

8 Likes

This is great! It’s fascinating to see which ones it’s succeeded on vs the one’s I thought it would succeed at. For instance, I thought the following was going to be a piece of cake:

But the answers from the models were no where close.
Have you seen any improvements in the last 8 months?

I haven’t tried with GPT-4 lately. The vision version was not good with ARC in a few tests I tried. For my personal models, I have seen a lot of advancement (up to 22% on Kaggle which is state of the art for a single solution).

what about OpenAI o1 model? Has anyone tried this?

Yes, they just tried it. It scored 21% on the ARC public leaderboard–tying with Claude 3.5 Sonnet. The private leader board appears to be probably harder, and we still have the lead with 46%.

1 Like