Self-made benchmark of 50 really hard riddles (Results)

For anyone who is interested, I put together 50 really hard riddles to see the step by step reasoning of the models.

Scoring mechanism: GPT-4o graded the responses from 0 to 9, based on correctness and logical integrity. It didn’t know which model the response was from, and was provided the actually correct answer to the riddle during scoring.
I checked the first few scorings myself, and they are pretty fair, although I can not exclude the possibility of gpt-4o slightly preferring certain answers.

Rank Model Score
1. GPT-4o 7.67
2. Llama3-70b 6.39
3. Gemini-1.5-Pro 6.16
4. Gemini-1.5-Flash 5.86
5. GPT-3.5-Turbo 5.04
6. Llama3-8b 4.04

Here are some example riddles so you can get a feel for the benchmark:

Riddle:
Abigail, Oliver, Rosa, and Blake all attend the same summer camp, where they can cook, kayak, rock climb, and zip-line. Each child has a different favorite activity.\n\nAbigail’s favorite activity isn’t rock climbing.

Oliver is afraid of heights.
Rosa can’t do her favorite activity without a harness.
Blake likes to keep his feet on the ground at all times.
Can you figure out who likes what?

Reasoning: At first the only sure thing is that Blake likes to cook, because all other activities are not ‘on the ground at all times’. Other first reasoning steps are not possible. As the second step of reasoning has to be Oliver, who is afraid of hights and with cooking gone that means he likes kayaking. Abigail likes to zip-line, because she doesn’t like rock climbing, Rosa likes to rock climb (process of elimination

Riddle
Daniel, Emily, Marciano, and Christina are all wearing solid-colored shirts. Their shirts are red, yellow, green, and blue. Only the person wearing blue tells the truth, while the other three lie. They make the following statements:

Daniel: ‘Marciano is wearing red.’
Emily: ‘Daniel is not wearing yellow.’
Marciano: ‘Emily is wearing blue.’
Christina: ‘I will wear blue tomorrow.’

Can you determine each person’s shirt color, and whether we can expect to see Christina in blue tomorrow?