Measuring hallucinations in a RAG pipeline

niks · April 30, 2024, 1:43pm

I have a llm system that works on RAG pipeline, the data is chained to my local knowledge base and the generator uses that info to answer user queries. I need to build a system that checks the content similarity from the baseline data used for paraphrasing, to estimate the hallucination of llm system, kindly suggest what would be the best ways to solve this problem

We are currently trying fuzzy matching, ngram scores and cross encoder models but I didn’t see any great results so far

eeseljose · April 30, 2024, 8:16pm

Hallucinations. You have in your hand a problem that not even OpenAI itself could handle.

You could try fighting fire with fire, create an assistant instructed to detect hallucinations in the responses of your primary bot. Set up functions that the assistant can use to search in your knowledge base and let the assistant look up the relevant information. Finally when the assistant is done doing its research it calls a function for you that gives a score on how much it thinks the response was hallucinated.

If you need a developer to set this up for you, let me know.

jfbaro · September 4, 2024, 2:56pm

I am trying to find a good enough solution for that same scenario.

Have you made any progress on that?

limsa44 · September 29, 2024, 10:38am

Below is a short description of my vision of solving the problem with hallucinations. This tool will allow LLM to form a module of value judgments and introduce discussion into reasoning. Thus, substitution of desired answers with contextually neutral “delusions, confabulations” will become difficult.

Problems:**
Firstly, the so-called “hallucinations” (delusions, deep dream, etc., hereinafter referred to as DD), which result in the generation of false, incorrect, or inadequate content.
Secondly, the issue of weak texts that are used for training language models. Additionally, there is the problem of data exhaustion for large-scale sampling, leading to the accumulation of repetition and redundancy. As a result, after penalizations, accuracy drops, and semantic quality diminishes.
Evaluating logic and factual accuracy based on “internet queries” is also problematic, as AGI cannot independently “be logical and accurate.”
New possibilities offered by RAG, integrated LLM-judges, the incorporation of so-called “reasoning” at the phase of answer evaluation and correction, and their analogs, have so far only proven effective at the level of test versions such as GPT-o1.

Proposed Solution:
We propose a fundamentally different approach. AGI should be supplemented with an additional module that will create systems of hierarchical assessments similar to those found in humans.
However, it is important to understand that this “similarity” is purely formal and even conditional: the multi-layered configurations of social-hierarchical evaluations and the behavioral patterns associated with them, which the human mind constructs during ontogeny, are not only excessively difficult to reproduce but also pointless and unproductive. Without the human skill of instantaneously transforming any contradiction into an act of behavior (through reflection in the sensory and emotional sphere), the neural network will receive not a way to solve complex problems but an unnecessary burden that draws additional computational resources.

1.0.

First of all, several options for creating a base for the AI’s evaluation model should be developed, allowing it to form its own system of preferences and hierarchies. This base is developed using texts that the AI must analyze and evaluate. For example, we propose three effective grid/scale options that can be used to form a specialized analysis module that will adjust all existing indexes.

1.0.1. Scale A: Impersonality vs. Individualization.

This scale evaluates the quantitative presence of lexical and grammatical indicators of authorization and non-authorization in the text. For example, in Russian, this can be determined by the number of reflexive forms, participial constructions, and third-person usage (acting as the agent). If these elements are more than average, the text is labeled as more impersonal. Obviously, there should also be sections within the text that exhibit higher or lower levels of impersonality/individualization.

1.0.2. Scale B: Coherence vs. Fragmentation (and/or Compilation).

Here, a minimum text volume requirement is necessary: calculating the coherence of a sentence is much more difficult than that of a paragraph unless an authorial style standard has already been set. The evaluation method is relatively simple: the distribution of different types of consonants and vowels in the text is counted, after which a scheme is built for the hierarchical transformations and mappings of “anchor” distribution points. This scheme serves as a reference matrix algorithm for the phonetic level of text explication (embodiment). The more sections of the text where the distribution of vowels and consonants can be described by this reference algorithm, the higher the degree of coherence and integrity of the text.

1.0.3. Scale C: Antagonism vs. Consolidation of Semantic and Formal Levels of Text Explication.

This scale evaluates the correspondence between the “anchor” sections of the distribution of the formal elements of the text (phonetics, syntax/punctuation, prosody, rhythm, etc.) and the focus/centers of the semantic explication of the text. Simply put, if the semantic center of the text coincides with an “anchor” section (extremum) in the distribution of formal elements, the levels of explication are consolidated. Otherwise, they are antagonistic.

1.1.

Now that we have defined the three scales for additional evaluation/indexing of the text, we establish the norm: texts with maximum values on the scales of authorization, coherence, and consolidation (scales A, B, and C, respectively) will be marked by our neural network as good. Conversely, texts with high values on the scales of impersonality, fragmentation, and antagonism will be marked as bad. Thus, we create the basis for forming an evaluation judgment module that will be integrated into the neural network.

#1.2.
The next step is to divide the available texts into three groups: good, bad, and neutral (conditionally G, F, and E).

1.3.

Next, we select the texts from group G (good) and determine the most typical benchmark patterns/algorithms for the distribution of indicators across scales A, B, and C. For further analysis, no more than the top two deciles of the most common patterns/algorithms are needed. Using these algorithms, we transform the texts in groups F and E (similar to sequencing, subsequent alignment, and CRISPR-editing). First, we structurally reorganize them to align with the requirements of the standard G-distribution of formal elements of the text, then we organize and edit the semantics to ensure they are as interpretable as possible.
This step, of course, requires manual fine-tuning with the help of an operator.
We follow the same process with the texts from group F (first we identify the benchmark distribution patterns, and then use them to transform the texts from groups G and E). Naturally, all possible reverse transformation trajectories are also checked.
As a result, we should obtain a three-dimensional matrix with axes: G-F, resistant/non-resistant to G-influence, resistant/non-resistant to F-influence, and reversible/irreversible. This will provide the next level of the evaluation judgment module development in the neural network.

1.4.

At this level, it will be possible to distinguish between active/conflict evaluation judgments and passive/defensive ones. This will allow for the targeted training and preparation of fundamentally aggressive/conflict LLMs. Their models of self-evaluation of hierarchies and worldviews could be used as filters and markers of exceeding the acceptable DD level.
Thus, it is possible to bring DD into a regular/normative form and achieve predictability of the onset of this “thinking” phase. In the case of AI, this means making DD occurrences conscious and controllable.
Additionally, the tools required for indexing texts according to scale B criteria also allow for emotional-psychological analysis, identifying false/anomalous statements within a given text.

Topic		Replies	Views
Why is my fine-tuned model hallucinating? Community fine-tuning	2	2167	October 6, 2023
How can we prevent large language models like GPT-4 from hallucinating? Community chatgpt	2	769	December 2, 2024
LangChain vs File_Search: Which is Better for Reducing GPT-4 Hallucinations? API api , langchain , hallucinations , api-hallucinations , gpt-4o-mini	2	293	January 6, 2025
How to Reduce Hallucinations in ChatGPT Responses to Data Queries Prompting gpt-4 , adv-data-analytics	5	8337	December 2, 2024
Retrieval-augmented generation (RAG)/endpoint assist tools API gpt-4 , chatgpt , api , lost-user , assistants-api	1	146	March 3, 2025