Trouble extracting all information from long context document

ForrestT · May 8, 2024, 12:15am

I am using GPT 4 Turbo to extract quotations from a large document. My goal is to find all quotations that match criteria specified in my prompt. This works well for small documents but, as the document grows to be over 15k tokens I find that many of the quotations I feel should have been included are missing. Twice as many quotes tend to be found from the first third of the documents and very few are found from the middle third of the documents.

Some tentative theories of possible causes:

Perhaps the LLM tends to start finding quotes from the beginning of the document and at some point feels like it has output “enough” quotes. The number of output tokens is typically less than 1k, so I think should have had plenty of space to continue. Perhaps, in training, it is unusual to see lists with more than about a dozen items and the LLM simply stops when it has output a typical amount of items. I have tried various prompts instructing it not to limit the length of its output and to keep looking to find all quotes matching the criteria, but this only had a small effect.
Perhaps LLM optimizations that increased the size of the context window limit the LLM’s ability to examine this large window as effectively as a smaller window can be examined. Or perhaps it is optimized to answer a single question by finding a “needle in a haystack” in the large context, but doesn’t have enough “attention” to find “all the needles in the haystack”.

I am currently trying to work around this by using smaller chunks, but in smaller chunks I tend to have the opposite problem where quotations are included that I would prefer to have omitted. Perhaps in a smaller chunk the LLM feels it should try to include a minimum number of items and relaxes the criteria for what it extracts. I have tried to give objective criteria for what should and should not be included, but my domain is inherently somewhat subjective.

Unfortunately, this seems to be worse with the latest GPT 4 Turbo model. I started having this problem at 15k tokens with gpt-4-1106-preview. But with gpt-4-turbo-2024-04-09, the problem starts happening at 8k tokens. I haven’t been able to find a solution that matches the quality I was achieving with the earlier model.

Does anyone have more solid theories for why this would happen or tips on how to overcome this limitation? Thanks!

wclayf · May 8, 2024, 5:26am

This is a well known “Needle in Haystack” problem where I think the beginning and ending of a prompt are “understood” better than the middle. So breaking into smaller chunks is good. Maybe ask the LLM to generate a score from 1 to 100 of how likely each paragraph is to match your criteria and then you make the cutoff yourself. That might get it to be more “honest” or reliable about how it evaluates the criteria.

Also I wonder if lowering the temperature would also give better results for this use case. Not sure.

johannaB · May 8, 2024, 10:49am

Our experience of Chat GPT 4 web version is that in long extracts it is not analysing the content in its entirety however we engineer the prompt. My feeling is that it loses interests, or is taking a less precise, analytical approach.
We’ve consistently used Chat GPT4 for over a year for abstract analysis, looking at academic papers, being a French tutor, answering questions about specific books, authors, writing web content , writing code etc etc.
My perception is that it is not performing as it used to and that it creates more generalised verbiage than precise answers.
I have a lot of experience of engineering prompts to fine tune, improve, redirect etc etc.
This is my perception, maybe I’m expecting more and more from it having got over our initial WOW.

DimitriSud · May 8, 2024, 3:54pm

Yeah, it’s probably a positional bias and a bias towards returning a limited number of items (e.g., 10-15).

For example, if you ask the LLM to write bullet points summarizing a meeting transcript, it will always give you around 10 without regard to whether it’s a 20-minute meeting or a 2-hour meeting.

A surprising weakness I’ve seen is extracting all proper nouns. It’s pretty easy for humans, but even the best LLM seems to miss quite a few.

Appertivo · May 8, 2024, 3:58pm

@DimitriSud welcome!

I agree about breaking the text down into smaller chunks. I review hundreds of reviews and online content using gpt4 and find it more accurate and faster to do so.

mmcsei · October 29, 2024, 6:46pm

I’m trying to extract a list of all quotes from a long document that’s related to certain criteria using GPT-4o. I’m having the same issue. No matter how many actual quotes should be extracted (sometimes over 100), it always seems to return about 20.

What can be done about this?

_j · October 29, 2024, 11:48pm

How many tokens are you attempting to produce? If over 1000 or so, the OpenAI AI is going to find ways to finish its output by training.

You have either:

chunking the document down to where responses are not prematurely terminated with missing desired content,
fine-tuning on the huge document and huge output you’d desire in the application (likely needing the first technique to develop training data).

Then prompting:

giving the AI the specification that it can return up to 500 quotes because its response length is “upgraded”,
that the quote discovery should procedurally progress through the document as if it was being read, and not stop until the end of the document.

Then exploring other AI models.

Topic		Replies	Views
Response being cut off in Azure OpenAI API	6	2260	January 30, 2024
GPT-4 does not utilise all output tokens (max_tokens=4095) Prompting gpt-4 , api	3	902	June 13, 2024
Issues with Truncated Responses API	3	2566	April 22, 2024
Longer GPT 3.5-turbo Output Prompting gpt-35-turbo , api	23	4360	December 8, 2023
Gpt-3.5-turbo-16k Maximum Response Length Prompting api	33	34729	December 13, 2023

Trouble extracting all information from long context document

Related topics