HallucinationBench — detect hallucinations in RAG output, now on PyPI

bdeva1975 · April 3, 2026, 4:02am

Hi everyone,

Just published HallucinationBench to PyPI — a lightweight library for
detecting hallucinations in RAG pipeline output.

pip install hallucinationbench

Usage:

from hallucinationbench import score

result = score(context=docs, response=llm_output)
print(result.verdict) # PASS / WARN / FAIL
print(result.faithfulness_score) # 0.0 – 1.0
print(result.hallucinated_claims) # list of fabricated statements

It uses GPT-4o-mini as a structured judge (~$0.001 per eval).
No embeddings, no vector DB, no infrastructure.

Two design decisions I would love feedback on from this community:

Using response_format: json_object with temperature=0 for
deterministic structured output — any edge cases I should handle?
Verdict thresholds (PASS >= 0.8, WARN >= 0.5, FAIL < 0.5) —
do these feel right for production RAG systems?

PyPI: Client Challenge
GitHub: GitHub - bdeva1975/hallucinationbench: Detect hallucinations in your RAG pipeline output — in two lines of Python. · GitHub

Feedback and PRs welcome!

Topic		Replies	Views
Measuring hallucinations in a RAG pipeline Community hallucinations , api-hallucinations	4	1464	April 3, 2026
How should we evaluate hallucinations in RAG systems when semantically similar context may still be irrelevant or incorrect, and the real failure may lie in retrieval or source quality, not the model itself? Community api , lost-user	3	352	August 22, 2025
Why language models hallucinate [OpenAI Research Paper] Community hallucinations	2	588	September 5, 2025
Retrieval-augmented generation (RAG)/endpoint assist tools API gpt-4 , chatgpt , api , lost-user , assistants-api	1	305	March 3, 2025
Mitigating Hallucinations in RAG - a 2025 review Community rag	7	501	October 12, 2025

HallucinationBench — detect hallucinations in RAG output, now on PyPI

Related topics