Hi everyone,
Just published HallucinationBench to PyPI — a lightweight library for
detecting hallucinations in RAG pipeline output.
pip install hallucinationbench
Usage:
from hallucinationbench import score
result = score(context=docs, response=llm_output)
print(result.verdict) # PASS / WARN / FAIL
print(result.faithfulness_score) # 0.0 – 1.0
print(result.hallucinated_claims) # list of fabricated statements
It uses GPT-4o-mini as a structured judge (~$0.001 per eval).
No embeddings, no vector DB, no infrastructure.
Two design decisions I would love feedback on from this community:
-
Using response_format: json_object with temperature=0 for
deterministic structured output — any edge cases I should handle? -
Verdict thresholds (PASS >= 0.8, WARN >= 0.5, FAIL < 0.5) —
do these feel right for production RAG systems?
PyPI: Client Challenge
GitHub: GitHub - bdeva1975/hallucinationbench: Detect hallucinations in your RAG pipeline output — in two lines of Python. · GitHub
Feedback and PRs welcome!