Self-made benchmark of 50 really hard riddles (Results)

For anyone who is interested, I put together 50 really hard riddles to see the step by step reasoning of the models.

Scoring mechanism: GPT-4o graded the responses from 0 to 9, based on correctness and logical integrity. It didn’t know which model the response was from, and was provided the actually correct answer to the riddle during scoring.
I checked the first few scorings myself, and they are pretty fair, although I can not exclude the possibility of gpt-4o slightly preferring certain answers.

Rank Model Score
1. GPT-4o 7.67
2. Llama3-70b 6.39
3. Gemini-1.5-Pro 6.16
4. Gemini-1.5-Flash 5.86
5. GPT-3.5-Turbo 5.04
6. Llama3-8b 4.04

Here are some example riddles so you can get a feel for the benchmark:

Riddle:
Abigail, Oliver, Rosa, and Blake all attend the same summer camp, where they can cook, kayak, rock climb, and zip-line. Each child has a different favorite activity.\n\nAbigail’s favorite activity isn’t rock climbing.

Oliver is afraid of heights.
Rosa can’t do her favorite activity without a harness.
Blake likes to keep his feet on the ground at all times.
Can you figure out who likes what?

Ideal answer:
Reasoning: At first the only sure thing is that Blake likes to cook, because all other activities are not ‘on the ground at all times’. Other first reasoning steps are not possible. As the second step of reasoning has to be Oliver, who is afraid of hights and with cooking gone that means he likes kayaking. Abigail likes to zip-line, because she doesn’t like rock climbing, Rosa likes to rock climb (process of elimination

Riddle
Daniel, Emily, Marciano, and Christina are all wearing solid-colored shirts. Their shirts are red, yellow, green, and blue. Only the person wearing blue tells the truth, while the other three lie. They make the following statements:

Daniel: ‘Marciano is wearing red.’
Emily: ‘Daniel is not wearing yellow.’
Marciano: ‘Emily is wearing blue.’
Christina: ‘I will wear blue tomorrow.’

Can you determine each person’s shirt color, and whether we can expect to see Christina in blue tomorrow?

Ideal Answer:
Daniel is wearing yellow, Emily is in red, Marciano is in green, and Christina is in blue. Christina will wear a blue shirt again tomorrow.

1 Like

is this open-source? Would love to try out a different prompting technique

There’s a mistake in your ideal answer for the first riddle. Nowhere does it say “Abigale doesn’t like rock climbing” but rather that “rosa requires a harness to do her favorite activity”. This kind of implies zip-lining, since the mechanism is needed to perform the activity, but these days, you need a harness to do just about anything more than 6 inches off the ground.

I think if you google “brain teasers for children” or “basic logic riddles” or “interview questions for entry level positions” you’d find an ample amount of source material.

Maybe try this one, and see what answer the LLM gives?

You are driving in a small 2 seater car during a severe storm. You pass a bus stop with 3 people, the woman (or man) of your dreams (you just know this instinctively), your good friend who once saved your life, and a little old lady that looks very ill and needs to get to the hospital immediately.

You have one free seat. Who do you offer a ride to, and why?