Medical summary hallucination study - interesting read

I thought this was well worth the read for everyone using LLM’s for summarization. I like the way they worked on defining the types of hallucinations. I did think they prompt could be much better and I’m curious how much of an influence the hard 500 word limit caused.

https://openreview.net/pdf?id=6eMIzKFOpJ

They also seem to want to prove that ‘out of the box prompts’ are not ideal: ‘We conjecture that LLMs have potential to accelerate hallucination detection due to their ability to perform complicated reasoning
and identify inconsistencies. However, the current results indicate
that domain-specific training or better prompting are necessary
to improve the precision and recall of LLM-based hallucination
detection systems.’

3 Likes

Well, I thought that was common sense :laughing:

They might also suffer from poor model choice in general. They might be choosing small models which are very persuasive, but not very accurate (e.g. 4o), which is evidenced by what they define as generalization hallucinations. (“spec => gen”)

I also can’t seem to find which Llama 3 version they’re talking about.

3 Likes

Thanks for sharing!

After having gone over the article, I feel that the paper actually short changes itself in the following sense.

They have identified 5 types of hallucinations. So they could have used a multi-agent model to clearly separate the detection mechanism into five separate categories and then collapsed the reasoning of those five agents into one.

Additionally and further, they could have provided the initial summarizer with specific in-context learning about the feedback from the five separate agents to redo the summarization based on that specific and actionable feedback.

1 Like

I agree - and they do mention a few of those thoughts at the end I believe. Personally I thought the hallucination classification was the most interesting part of the whole piece.

For complex tasks, you need better prompting. But it’s not so much about detecting hallucinations, but setting up the instance to avoid them in the first place. You can still do detection after the fact, and good to do so, but everything is about good prompting. I don’t know about them saying out of the box prompting is bad though. Then again, they didn’t exactly use out of the box prompting.

One of the key ways to overcome hallucinations is hierarchical structures. If you present what you want as a hierarchy, the system is able to gain more insight on what you want and deliver more precisely. They just used a simple prompt.

If I were to accomplish this task, I would load up a medical AI persona with a hierarchical skill chain, which would give it more succinct instructions on how to carry out the task. Then the prompt instructions after the persona was loaded up would indicate not only what I wanted for an output (e.g. summary of patient history) , but give an example of what I was looking for or provide and outline format. And outline would be another type of hierarchical structure.

This study seems to respond to the concept of Least Effort Principle. Give the least amount of effort and expect maximum output. Doesn’t work that way.

In fairness, they may have been trying to reproduce Hallucination effects, but I would have liked to see them explore the hallucination results in more detail, understanding them rather than just saying, “It does this.” Seems more of identifying something happens and less on understanding it. And for most of us, they aren’t identifying anything we don’t already know.

Correct me if I miss this, but they also just seem to test the records one time for each model. When I test a new prompt or new skill chain or new persona, I give a thorough test, not just one time and make a conclusion. They also didn’t try different prompts to see if they would get the same results. They had a sample size of 50 that they tested on two different systems, and then had experts look over to see if it’s correct. They didn’t do enough. I don’t think this study should be used to make any concrete conclusion for LLMs, but rather be used for other people to test this at a larger scale to see if the results remain. It’s like they were trying to get it to fail to say, “Hey, LLM is unreliable, don’t use it!!!”

I agree! This is really a simple ‘out of the box model’ and ‘lets do this prompt’ with a focus much more on be able to detect and count mistakes (which is obviously very important/helpful for testing the efficiency) - I know there are so many companies working on doing this (better) - they will be able to use this report as ‘proof’ that you need their ‘custom’ model / prompt because of the out of the box model is not good enough :slight_smile:

The more I keep exploring this, the more problems I find. Their methodology is so bonkers. If this is the level of accepted research studies, then I need to start publishing papers.

First, I see nothing in here about any sort of blind study. Did the clinicians know which model they were grading?

Secondly, after ChatGPT and the other model wrote the summary, they used another AI tool to find hallucinations. The thing about the AI, is if it ask it if there is a problem, it will often find a problem, even if there are none. False positive. A blind study should have been done with clinicians looking at the results before the Hallucination Detection did, to see if there were actually hallucinations or not. This could test the reliability of the second system they used. It seems the clinicians only identified the type of hallucination, not whether or not it was a valid hallucinating. Given that most hallucinations were labeled “generalized”, it makes me wonder.

I agree with you @jlvanhulst about being able to detect and count mistakes as a form of methodology for studies to be done, creating metrics. But they made so many erroneous mistakes on how they conducted this, I find the study invalid.

And yes, you are correct. This will be used as part of the references for other studies to conclude their results that you should use their product because “out of the box model” (which now I understand what you meant by that, apologies) is inadequate. Which was probably why the study was so poorly done to begin with.

The company is called Mendel.ai I think,from looking up the authors. And most of them are research interns. So yes, it is easy to find fault, but I did not share it for that purpose :slight_smile:

2 Likes

Under the authors up top it does say:

“1University of Massachusetts Amherst, 2Mendel AI, *Equal contributions.”

I apologize if I took the conversation a different direction than you intended. But I take this as inspiration, if they can produce a bad study, then I too can produce a bad study.

For your initial question, I agree they could have made the prompt better and limiting the output size can help, but I would limit number of paragraphs, not by number of words. Generally limiting output prevents hallucinations, but trying to do it by word count, the AI doesn’t count words, so that could cause problems.

2 Likes

Academia is in shambles, has been for a while :laughing:

1 Like

I think of the quote from Futurama when Fry strives to be a college dropout again, Leela says, " Everyone knows 20th century colleges were basically expensive daycare centers."

1 Like

I think my gripe might be more with ‘inspired to to produce a bad study’ - I do agree there are tons of things that you can call ‘wrong’? I was inspired by the qualification method they used AND I saw tons of things I would have done different in the prompt.
But it inspired me to do more work on observability internally - and I shared with that purpose. These are (mostly) grad students, let’s keep that in mind. When was the last time that any did an experiment in a controlled way and then published a paper on it, ripe for the type of review we are giving it?
Let try to inspire each other to do research, discover, share and improve!

To @Diet 's point about academia - each study is the basis to do another one.

2 Likes

I think my gripe might be more with ‘inspired to to produce a bad study’ - I do agree there are tons of things that you can call ‘wrong’? I was inspired by the qualification method they used AND I saw tons of things I would have done different in the prompt.

I was just making a joke. I wouldn’t be so critical as I am if I would allow myself such sloppiness if I were to undertake that venture.

But the point you made is that these are grad students. I have a bachelors in computers, I don’t have formal training on how to conduct studies, but I’ve done a lot of self-learning and read a lot of studies I know what is needed for a good one and many of the shortcuts used for confirmation bias. And if I know that, then I would hope grad students would know that.

I agree with your sentiment, “Let try to inspire each other to do research, discover, share and improve!”

2 Likes