HallucinationBench — detect hallucinations in RAG output, now on PyPI

Hi everyone,

Just published HallucinationBench to PyPI — a lightweight library for
detecting hallucinations in RAG pipeline output.

pip install hallucinationbench

Usage:

from hallucinationbench import score

result = score(context=docs, response=llm_output)
print(result.verdict) # PASS / WARN / FAIL
print(result.faithfulness_score) # 0.0 – 1.0
print(result.hallucinated_claims) # list of fabricated statements

It uses GPT-4o-mini as a structured judge (~$0.001 per eval).
No embeddings, no vector DB, no infrastructure.

Two design decisions I would love feedback on from this community:

  1. Using response_format: json_object with temperature=0 for
    deterministic structured output — any edge cases I should handle?

  2. Verdict thresholds (PASS >= 0.8, WARN >= 0.5, FAIL < 0.5) —
    do these feel right for production RAG systems?

PyPI: Client Challenge
GitHub: GitHub - bdeva1975/hallucinationbench: Detect hallucinations in your RAG pipeline output — in two lines of Python. · GitHub

Feedback and PRs welcome!