Feature Request: Support for Custom Graders Using Prolog or External Logic Solvers in OpenAI Evals

Hi all,

I’m working on research involving reinforcement learning for logic tasks, where I need to provide verifiable rewards using a Prolog solver (or similar logic programming tools). Currently, OpenAI Evals does not support custom graders that can call out to Prolog or use external libraries like HuggingFace’s evaluate for logic-based reward computation.

This limitation makes it difficult to run RL experiments or evaluations where the correctness of a model’s output must be checked by a logic program or a symbolic judge. I’d like to request support for custom graders that can:

Call out to a Prolog engine (e.g., SWI-Prolog) or other logic solvers.
Use external Python libraries for symbolic or logic-based evaluation.
Return verifiable, programmatic rewards based on logical correctness.
This would enable research on RL for logic and reasoning, and allow for more robust, automated evaluation pipelines.

Is there any plan to support this, or are there recommended workarounds for integrating external logic-based graders with OpenAI Evals?

Thanks!

1 Like