Struggling to extract correct items from list in document

danielfriis · September 23, 2024, 3:05pm

Hello everyone!

I have a document (~30 pages) which contains a table of questions and answers stretching over four pages.

The content is extracted from a PDF. Not perfect, but quite good. Tables are well-formatted etc.

I want the LLM (4o or 4o-mini) to extract the Q&As which satisfy certain criteria. However, I simply can’t get it to return all the Q&As which satisfy the criteria. It usually returns the first few and last few.

When I ask it to return all (without the criteria), it returns the correct list — i.e. it knows that all the Q&As are part of the same section.

The criteria is simple (answer == ‘Yes’).

Has anyone experienced something like this? If so, did you find a way to solve?

(PS. I’d prefer for the LLM to do the work and not a SQL query e.g.)

jr.2509 · September 23, 2024, 3:24pm

Welcome to the Forum!

This is not an uncommon problem for documents of this size.
One way to overcome this is to chunk the text and then feed the model the chunks individually with the request to extract the Q&A that meet your criteria and then concatenate the results.

It costs a little extra as you have re-send the prompt every time but if the prompt is not too large these extra costs should be marginal.

platypus · September 23, 2024, 3:53pm

@jr.2509 's answer is on-point. Another thing you can try is to structure it into two phases: (1) question extraction, (2) answer extraction.

So first collection all the questions in one call, and then in one or more calls extract answers for those questions. You might be able to get away with just two calls, or alternatively you can “batch” the questions (e.g. take 3-4 at once).

danielfriis · September 23, 2024, 3:55pm

Thanks — appreciate the insight @jr.2509 and @platypus !

Do you know if there’s a robust ‘agentic’ way to do it?

The scenario was just one example. I have other scenarios with different queries (but similar structure).

platypus · September 23, 2024, 4:13pm

Agents are in the most naive sense just specialized prompts, so same rules apply there - some kind of chunking or phased approach. E.g. one agent acts as a “question extractor” based on the criteria, and sends the question to another agent that acts as the “answer extractor”.

Whether you use agents or just standard ChatCompletion calls, one rule of thumb I recommend is to stick to a single “instance” per call. What I mean is - if you are let’s say extracting information about cities, you could extract multiple attributes (location, size, topology, etc), but as long as you are dealing with a specific “instance” (in this case, a specific city). As soon as you start extracting attributes for multiple instances, things go south, and they problem scales with the amount of data the LLM has to parse. And this maps to your question/answer problem as well.

Might be possible to overcome with some clever prompts, maybe others have some tips, but this is one rule that hasn’t let me down so far (also using other LLMs, and even open weight ones as well).

sebastian.chejniak · September 23, 2024, 5:03pm

I’m not sure I understand your point, regarding sticking to a single instance per call.

What would it look like to apply this rule to OP’s problem? In other words, what is the “instance” in this case, and what would it look like to stick to a single instance per call?

Would it be making repeated LLM calls, specifically asking for Question-Answer pair per prompt, and then combining the result?

nicholishen · September 23, 2024, 6:56pm

It’s well known that LLMs need to see themselves think. Based on your description, it seems like there’s some implied logic that the LLM must perform that is precluded from the output. For example, if you tell the model to think about something and only output the results – that “thinking” doesn’t really happen since the model didn’t have the opportunity to output it. This can result in hallucinations and inaccuracies, such as what you are experiencing. Since the LLM is capable of extracting all the QAs from the text, the next step is to use the LLM to annotate the questions with the answer so you can filter them out programmatically. Here’s a quick and dirty…

import pydantic
import typing
import openai

client = openai.OpenAI()

src_qa_data = """\
Q1: Is the sky blue?
A1: Yes.

Q2: Can humans breathe underwater without assistance?
A2: No.

Q3: Do penguins live in the Arctic?
A3: No.

Q4: Is water made up of hydrogen and oxygen?
A4: Yes.

Q5: Can plants produce their own food?
A5: Yes.    
"""


class QAItem(pydantic.BaseModel):
    question: str
    answer: typing.Literal["yes", "no", "other"]
    
class QAExtractor(pydantic.BaseModel):
    """Use this tool to extract ALL of the QA items from the source text."""
    questions_answers: list[QAItem] = pydantic.Field(description="ALWAYS Gather ALL QA items from the source text.")
    is_all_qa_included: bool = pydantic.Field(description="Have all QA items been included in the `questions_answers` from the source text?")


r = client.beta.chat.completions.parse(
    model="gpt-4o-mini",
    response_format=QAExtractor,
    messages=[
        {"role": "user", "content": src_qa_data},
    ],
)

qa_items = r.choices[0].message.parsed
if qa_items.is_all_qa_included:
    for qa in qa_items.questions_answers:
        if qa.answer == "yes":
            print(qa)

# question='Is the sky blue?' answer='yes'
# question='Is water made up of hydrogen and oxygen?' answer='yes'
# question='Can plants produce their own food?' answer='yes'

NoRhyme · September 23, 2024, 7:33pm

I am surprised it’s failing over such a small dataset.

Based on my own experience I would expect you could achieve your goal with a single prompt and pass through the system, even if you are asking it to perform multiple tasks in that pass through.

Are you providing it with examples (few shot learning) ?

You could have a dedicated agent or process find the start and end of the text. But again, from what you’ve listed above, with some effective prompting you should be able to achieve your goal

danielfriis · September 23, 2024, 7:55pm

I’m surprised too, to be honest. But I’ve tried so many prompt variations now, that I don’t think it can be done (reliably). I’ve also tried with extracting the four pages which contains the answers.

danielfriis · September 23, 2024, 7:58pm

@platypus , @nicholishen thanks for your replies and input!

danielfriis · September 23, 2024, 7:59pm

I actually got an idea. Maybe the solution is to iterate over the document, identifying just one Q&A pair at a time.

Similar to what you suggested @jr.2509 , but by injecting the entire document each time together with the list of Q&As already identified.

Will try tomorrow and report back.

nicholishen · September 23, 2024, 8:11pm

Sorry but I’m a bit confused. Earlier you said:

If you’re already able to extract all the QAs (without criteria), then worst case you could work with this data instead of using multiple inferences of the full context. More than likely even that is not necessary because it seems like you’re already getting 90% of what you need. To me, it seems like something that could be fixed with structured outputs and programmatic filtering, no? I’d be willing to bet that since you can already extract all QAs without defining a criterion, that you can also extract those same QAs coupled with LLM annotations that can be used for filtering.

jr.2509 · September 24, 2024, 2:49am

Hm. But that approach seems quite token inefficient in practice.

Topic		Replies	Views
Trouble extracting all information from long context document API gpt-4	6	1414	October 29, 2024
Poor quality response on trained LLM with pdf files Community gpt-4	29	6402	May 1, 2024
Using prompt elements as context with Chat-gpt 4? Prompting gpt-4 , api	12	4540	December 20, 2023
New 4-turbo model has a unique limit? Or is this a bizarre hallucation? API	18	4500	January 26, 2024
Scaling RAG chatbot system to millions of documents API gpt-4 , prompt-engineering , rag	18	6457	February 28, 2024

Struggling to extract correct items from list in document

Related topics