I’ve been don’t experiments with simple prostitution logic and getting very mixy results, whether it be in chat or completion mode. Has anyone figured out any way to get consistent performance on this task type?

By propositional logic i mean
‘Fleas can jump 50 inches.
Fleas with ess than 4 legs cannot jump.
My flea has 3 legs.
Therefore,’

And up to 50% of the time it will get that wrong. Not literally using that example but something similar, and often with less premises. Rerunning the same prompt with the same temperature gives a 30% variation.

The only thing i haven’t done is allow it to talk through steps, since that makes parsing the output more difficult (or cost an extra api call).

Do not use an LLM for such evaluations, as you note it is not 100% accurate. If you need 100% accurate results then use a logic programming language like Prolog, or s(CASP) with Ciao.

Actually they can, but need help, see the thread on prompts as psuedo-code Topics tagged prompts-as-code.
Especially note @stevenic and @qrdl comments near the (current) end of the thread about the work MS folk are doing - essentially writing a planner in prompts. If you follow this path, though, you will essentially end up writing your own theorem prover in prompt-language. And, of course, LLMs can wander. But, if you set temperature low and have it regularly check its work, it can be made to work.
Is it worth the trouble when prolog is right-at-hand? maybe.

My example was some nonsense i spontaneously typed. My real use case is a puzzle.

It has a very low success rate at determining whether an empty jug can be filled when told that the jug is empty and that empty is can be filled. It averages cost to 65%. That doesn’t sound like excellence to me. Can you educate me on how to do better. You were in that other thread described above, correct?

Your example is logically inconsistent.
The first statement doesn’t say ‘Most’ fleas can jump, it simply says 'Fleas can jump.
***Disclaimer - the following is very sloppy logical form, actually predicate logic, not propositional. ***

I don’t know how to translate this into propositional logic other than to say something like Flea(x) and not Jump(x).
the next statement says Fleas with less than 4 legs cannot jump.
I don’t know how to translate that into propositional logic other than to say:
Flea(x) and LessThanFourLegs(x) and not Jump(x).

But from these two statements we can derive:
Flea(x) and not LessThanFourLegs(x)

But you then assert Flea(myFlea) and LessThanFourLegs(myFlea)
We have now reached an inconsistency. From that you can derive anything you want.