I have a document (~30 pages) which contains a table of questions and answers stretching over four pages.
The content is extracted from a PDF. Not perfect, but quite good. Tables are well-formatted etc.
I want the LLM (4o or 4o-mini) to extract the Q&As which satisfy certain criteria. However, I simply can’t get it to return all the Q&As which satisfy the criteria. It usually returns the first few and last few.
When I ask it to return all (without the criteria), it returns the correct list — i.e. it knows that all the Q&As are part of the same section.
The criteria is simple (answer == ‘Yes’).
Has anyone experienced something like this? If so, did you find a way to solve?
(PS. I’d prefer for the LLM to do the work and not a SQL query e.g.)
This is not an uncommon problem for documents of this size.
One way to overcome this is to chunk the text and then feed the model the chunks individually with the request to extract the Q&A that meet your criteria and then concatenate the results.
It costs a little extra as you have re-send the prompt every time but if the prompt is not too large these extra costs should be marginal.
@jr.2509 's answer is on-point. Another thing you can try is to structure it into two phases: (1) question extraction, (2) answer extraction.
So first collection all the questions in one call, and then in one or more calls extract answers for those questions. You might be able to get away with just two calls, or alternatively you can “batch” the questions (e.g. take 3-4 at once).
Agents are in the most naive sense just specialized prompts, so same rules apply there - some kind of chunking or phased approach. E.g. one agent acts as a “question extractor” based on the criteria, and sends the question to another agent that acts as the “answer extractor”.
Whether you use agents or just standard ChatCompletion calls, one rule of thumb I recommend is to stick to a single “instance” per call. What I mean is - if you are let’s say extracting information about cities, you could extract multiple attributes (location, size, topology, etc), but as long as you are dealing with a specific “instance” (in this case, a specific city). As soon as you start extracting attributes for multiple instances, things go south, and they problem scales with the amount of data the LLM has to parse. And this maps to your question/answer problem as well.
Might be possible to overcome with some clever prompts, maybe others have some tips, but this is one rule that hasn’t let me down so far (also using other LLMs, and even open weight ones as well).
I’m not sure I understand your point, regarding sticking to a single instance per call.
What would it look like to apply this rule to OP’s problem? In other words, what is the “instance” in this case, and what would it look like to stick to a single instance per call?
Would it be making repeated LLM calls, specifically asking for Question-Answer pair per prompt, and then combining the result?
It’s well known that LLMs need to see themselves think. Based on your description, it seems like there’s some implied logic that the LLM must perform that is precluded from the output. For example, if you tell the model to think about something and only output the results – that “thinking” doesn’t really happen since the model didn’t have the opportunity to output it. This can result in hallucinations and inaccuracies, such as what you are experiencing. Since the LLM is capable of extracting all the QAs from the text, the next step is to use the LLM to annotate the questions with the answer so you can filter them out programmatically. Here’s a quick and dirty…
import pydantic
import typing
import openai
client = openai.OpenAI()
src_qa_data = """\
Q1: Is the sky blue?
A1: Yes.
Q2: Can humans breathe underwater without assistance?
A2: No.
Q3: Do penguins live in the Arctic?
A3: No.
Q4: Is water made up of hydrogen and oxygen?
A4: Yes.
Q5: Can plants produce their own food?
A5: Yes.
"""
class QAItem(pydantic.BaseModel):
question: str
answer: typing.Literal["yes", "no", "other"]
class QAExtractor(pydantic.BaseModel):
"""Use this tool to extract ALL of the QA items from the source text."""
questions_answers: list[QAItem] = pydantic.Field(description="ALWAYS Gather ALL QA items from the source text.")
is_all_qa_included: bool = pydantic.Field(description="Have all QA items been included in the `questions_answers` from the source text?")
r = client.beta.chat.completions.parse(
model="gpt-4o-mini",
response_format=QAExtractor,
messages=[
{"role": "user", "content": src_qa_data},
],
)
qa_items = r.choices[0].message.parsed
if qa_items.is_all_qa_included:
for qa in qa_items.questions_answers:
if qa.answer == "yes":
print(qa)
# question='Is the sky blue?' answer='yes'
# question='Is water made up of hydrogen and oxygen?' answer='yes'
# question='Can plants produce their own food?' answer='yes'
I am surprised it’s failing over such a small dataset.
Based on my own experience I would expect you could achieve your goal with a single prompt and pass through the system, even if you are asking it to perform multiple tasks in that pass through.
Are you providing it with examples (few shot learning) ?
You could have a dedicated agent or process find the start and end of the text. But again, from what you’ve listed above, with some effective prompting you should be able to achieve your goal
I’m surprised too, to be honest. But I’ve tried so many prompt variations now, that I don’t think it can be done (reliably). I’ve also tried with extracting the four pages which contains the answers.
If you’re already able to extract all the QAs (without criteria), then worst case you could work with this data instead of using multiple inferences of the full context. More than likely even that is not necessary because it seems like you’re already getting 90% of what you need. To me, it seems like something that could be fixed with structured outputs and programmatic filtering, no? I’d be willing to bet that since you can already extract all QAs without defining a criterion, that you can also extract those same QAs coupled with LLM annotations that can be used for filtering.