Propositionsl logic performance

I’ve been don’t experiments with simple prostitution logic and getting very mixy results, whether it be in chat or completion mode. Has anyone figured out any way to get consistent performance on this task type?

By propositional logic i mean
‘Fleas can jump 50 inches.
Fleas with ess than 4 legs cannot jump.
My flea has 3 legs.

And up to 50% of the time it will get that wrong. Not literally using that example but something similar, and often with less premises. Rerunning the same prompt with the same temperature gives a 30% variation.

The only thing i haven’t done is allow it to talk through steps, since that makes parsing the output more difficult (or cost an extra api call).

Any ideas?

You are not going to like this but you did ask.

Do not use an LLM for such evaluations, as you note it is not 100% accurate. If you need 100% accurate results then use a logic programming language like Prolog, or s(CASP) with Ciao.


As we know this world is moving fast.

A paper that just came out and looks promising.

Tree of Thoughts: Deliberate Problem Solving with Large Language Models

While I have not read the paper yet, I would not be surprised if it could not solve problems with 100% accuracy requiring recursion.


You’re right. I don’t like picking up additional programming languages to finish projects.

With that said, I’d already reached the conclusion that LLMs are incapable of this task. I was just hoping for someone to prove me wrong.

Few-shot learning.

In the prompt give it two or three example syllogisms.

If that doesn’t work, do so again with chain-of-thought examples.

That should do it.

Actually they can, but need help, see the thread on prompts as psuedo-code
Topics tagged prompts-as-code.
Especially note @stevenic and @qrdl comments near the (current) end of the thread about the work MS folk are doing - essentially writing a planner in prompts. If you follow this path, though, you will essentially end up writing your own theorem prover in prompt-language. And, of course, LLMs can wander. But, if you set temperature low and have it regularly check its work, it can be made to work.
Is it worth the trouble when prolog is right-at-hand? maybe.


Afaict, GPT4 is excellent at propositional logic.

Your example above has both a typo and uses “Fleas with ess than 4 legs cannot jump.”

What does ‘less’ mean? You haven’t formally defined it, assuming you meant ‘less’. Typos matter when it comes to logic.

Perhaps you meant critical thinking, which yes, GPT4 can fail at for sure.

My example was some nonsense i spontaneously typed. My real use case is a puzzle.

It has a very low success rate at determining whether an empty jug can be filled when told that the jug is empty and that empty is can be filled. It averages cost to 65%. That doesn’t sound like excellence to me. Can you educate me on how to do better. You were in that other thread described above, correct?

Yes, GPT4 has extremely limited reasoning/inference capabilities. See here for more examples: GitHub - openai/evals: Evals is a framework for evaluating LLMs and LLM systems, and an open-source registry of benchmarks.

However, for real propositional logic, I haven’t been able to stump it. Even on zero COT, it seems to be able to do it in the vast majority of cases.

If someone has an example of real prop logic problem that it fails on, love to see it.

1 Like

Your example is logically inconsistent.
The first statement doesn’t say ‘Most’ fleas can jump, it simply says 'Fleas can jump.
***Disclaimer - the following is very sloppy logical form, actually predicate logic, not propositional. ***

I don’t know how to translate this into propositional logic other than to say something like Flea(x) and not Jump(x).
the next statement says Fleas with less than 4 legs cannot jump.
I don’t know how to translate that into propositional logic other than to say:
Flea(x) and LessThanFourLegs(x) and not Jump(x).

But from these two statements we can derive:
Flea(x) and not LessThanFourLegs(x)

But you then assert Flea(myFlea) and LessThanFourLegs(myFlea)
We have now reached an inconsistency. From that you can derive anything you want.