Newbie Question: Verifying Hallucinations and Key Details in AI Summaries

Hi everyone! !

I’m currently working on a task involving a long meeting transcript (about 10,000 tokens) and an AI-generated summary based on it. My goals are:

  1. To check if the summary contains any hallucinated information that wasn’t present in the original transcript.

  2. To ensure the summary includes the key details items I care about, such as important points and specific items.

I’m not a programming expert, and I mainly use OpenAI’s Assistant API for this kind of work.

I’d love to get your advice on:

  • How to design effective instructions for GPT Assistant to perform these checks?

  • Any tools, methods, or workflows you’d recommend to accurately compare the transcript with the AI summary, identify hallucinations, and verify the inclusion of key details?

Thank you so much for your suggestions and guidance! Looking forward to your insights. !

2 Likes

Hello Police, I want to report a newbie working on prohibited and potentially dangerous stuff! Make sure to put him into a learning center for at least 20 years or let him sign a paper where they promise to not touch this kind of application development ever again!

This is serious! Do not do that!

Keep your fingers away from medication dosage recommendations based on GPT models.

Are you nuts?

https://openai.com/policies/usage-policies/ read this please

No serious… The models constantly change. You can’t even make an application that creates an always safe cooking recipee with o1 - which might find out that adding chlorine is dangerous and adds a warning or doesn’t allow the answer at all -
but when it is supposed to think deeper - e.g. the chicken in this example might not be usable when stored at room temperature - warnings might come in full text but not when you ask for a specific output format.

And medication? Medical drug interactions? You will have to get a lot of information about the patient before you can do that reliably - probably a lot more than there is context in the model…

Even if your intention is somehow different - let’s say you are going to make summaries for medical doctors… they will rely on this information - really this is not a usecase for GPT model - yet.

Maybe you want to check special models that are trained on such data and get deep into machine learning and… nah forget about it… this seriously needs 20+ years of experience as a developer with vast experience in AI development and being a medical doctor by yourself also helps.

Hi @jason123 ! Super question!

There are several approaches that can be attempted:

  • If you already have some of this key data recorded somewhere as metadata/structured format (e.g. name, treatment, medication dosage), then you can use this to “ground” the LLM summary (see here for more on grounding)
  • Implement a separate procedure for extracting certain pre-defined information (names, treatment types etc). You can either use an existing LLM with a prompt to do this, or finetune your own entity extraction model. Then use the extracted data in conjunction with the summary.
  • Perform multiple summaries (e.g. three summaries) of the same report, then use the LLM as a “judge” of the three summaries

Some more tips when implementing this:

  • Get LLM to also emit confidence levels (e.g. low / medium / high) on summaries or on specific extractions (such as dosages)
  • Probably best for certain things that are very specific and critical, such as dosages, number of applications per day, etc, to actually make those very visible to the user and ask for a review of those anyway.
2 Likes

No! Did you read the article? It is full of warnings itself that it might not be the best solution to use semantical search - it even says you should concider using keyword search - which brings us back 10+ years…
And yes you can do that.

You can stack tons of methods for evaluation on top - but come on. Be realistic. This is absolute pro level - and even they won’t make it with GPT models! The models may be helpful at most - but the core for that must be something completely different.

Finetune your own model? On what? thousands of medical documents or why not use pubmed or snomed while we are at it. And chunking that is not a trivial task either - let aside the time it takes to do that.

Yeah, let three interns create a summary and use another intern to check the result. That sounds like a strategy you can use when it is about finding a link to a FAQ article or create a SEO optimized article - but medicamentation dosis extraction? What if it “hallucinates” a hundret times dose because the document has a strange structure?

page 1

...
100 mg
daily dose  

page 2
5000 mg 
daily dose  
50 mg


@jason123 mentioned they tried assistant api. I did as well… it has a 50% success rate at most to extract data on the documents I tried…

I am all about “let’s learn”!

But I am also saying that this stuff (medical information systems based on GPT models) will or at least should never go into production.

Even when you add an agentic network with tons of specialized models on top, and add another layer on top of the intern that checks the intern and use specialized models like BioGPT (which was trained on pubmed) and add the keyword search…

There has been research on that topic. When medical personal is given access to a tool that gives them information they will rely on that.

1 Like

That is the way!

@jason123 please promise to implement that

A prompt like

“Hey model there is a dosage recommendation for drug XY - it says N pills per day - is that correct?”

Will not help if the model doesn’t know the drug.

And the only correct answer could be “depends on who is going to take it and if the patient has any problems with the ingredients”.


When you find a way to extract it and then use the mechanism let’s say 100 times in parallel and all get the same result you can move the 5% failure rate to 0.05% - but first you need to reach that goal which would be fantastic and still there is a chance that someone dies from it…

1 Like

Thank you both for your suggestions. I later revised my question and didn’t realize that the example I gave would be seen as so significant.

Basically, I just wanted to focus on secondary data in the form of long-text case reports, along with summaries generated by AI from others. I was wondering how to check if the AI-generated summary includes hallucinated information not present in the original long text.

Additionally, I’d like to know if the summary contains the key elements I’m interested in (such as whether the summary accurately presents the patient’s condition, hospitalization, and simple diagnostic details mentioned in the long-text report).

Thank you both for your suggestions. I later revised my question and didn’t realize that the example I gave would be seen as so significant.

2 Likes