The ARC Challenge has been going on for some years, and has proven quite difficult for both ML and symbolic AI approaches (achieving only 1% and 29% correct on a hidden test set respectively). I ran an evaluation with GPT-4 on a public test set, which is easier than the hidden test set. In the hardest condition (0-shot prompt with no explanation), GPT-4 got 7% correct. This is good given how hard the task is and the fact that even some fine-tuned models have achieved only that level of success (my personal fine-tuned models have achieved up to 17% on the public 100-item test set after 10-months of work). Humans achieve around 80% accuracy, so LLM’s have a long way to go to achieve this level of performance.
If you want to see the responses from GPT-4, scroll down to the bottom of this jupyter notebook to see the output. Don’t try to run the notebook, because it won’t work. You’ll see each problem followed by a visualization of the correct answer, GPT-4’s response, and the response of our best model. It will look something like this.
We are working on a more complete evaluation with GPT-4 and other models, but I thought I would share this initial finding.