GPT-4 and the ARC Challenge

jackcole · April 19, 2023, 5:38pm

The ARC Challenge has been going on for some years, and has proven quite difficult for both ML and symbolic AI approaches (achieving only 1% and 29% correct on a hidden test set respectively). I ran an evaluation with GPT-4 on a public test set, which is easier than the hidden test set. In the hardest condition (0-shot prompt with no explanation), GPT-4 got 7% correct. This is good given how hard the task is and the fact that even some fine-tuned models have achieved only that level of success (my personal fine-tuned models have achieved up to 17% on the public 100-item test set after 10-months of work). Humans achieve around 80% accuracy, so LLM’s have a long way to go to achieve this level of performance.

If you want to see the responses from GPT-4, scroll down to the bottom of this jupyter notebook to see the output. Don’t try to run the notebook, because it won’t work. You’ll see each problem followed by a visualization of the correct answer, GPT-4’s response, and the response of our best model. It will look something like this.

We are working on a more complete evaluation with GPT-4 and other models, but I thought I would share this initial finding.

tonyr · December 5, 2023, 9:29pm

This is great! It’s fascinating to see which ones it’s succeeded on vs the one’s I thought it would succeed at. For instance, I thought the following was going to be a piece of cake:

But the answers from the models were no where close.
Have you seen any improvements in the last 8 months?

jackcole · December 29, 2023, 3:44pm

I haven’t tried with GPT-4 lately. The vision version was not good with ARC in a few tests I tried. For my personal models, I have seen a lot of advancement (up to 22% on Kaggle which is state of the art for a single solution).

lezwon · September 13, 2024, 8:01am

what about OpenAI o1 model? Has anyone tried this?

jackcole · September 14, 2024, 3:02am

Yes, they just tried it. It scored 21% on the ARC public leaderboard–tying with Claude 3.5 Sonnet. The private leader board appears to be probably harder, and we still have the lead with 46%.

Topic		Replies	Views
Testing New GPT-4o vs Top 5 AI Community gpt-4 , chatgpt , gemini , claude3 , gpt-4o	0	3080	May 14, 2024
An amusing new twist on GPT-3's ability to do arithmetic API	4	1404	April 5, 2022
Performance of GPT-4o on the Needle in a Haystack Benchmark API chatgpt , api , gpt-4o	13	6234	June 13, 2024
Determining if the user has changed a subject Prompting	11	2253	March 28, 2023
Just got access to GPT-4 but it responds like 3.5 API gpt-4	13	8228	July 8, 2023

GPT-4 and the ARC Challenge

Related topics