I am using GPT 4 Turbo to extract quotations from a large document. My goal is to find all quotations that match criteria specified in my prompt. This works well for small documents but, as the document grows to be over 15k tokens I find that many of the quotations I feel should have been included are missing. Twice as many quotes tend to be found from the first third of the documents and very few are found from the middle third of the documents.
Some tentative theories of possible causes:
Perhaps the LLM tends to start finding quotes from the beginning of the document and at some point feels like it has output “enough” quotes. The number of output tokens is typically less than 1k, so I think should have had plenty of space to continue. Perhaps, in training, it is unusual to see lists with more than about a dozen items and the LLM simply stops when it has output a typical amount of items. I have tried various prompts instructing it not to limit the length of its output and to keep looking to find all quotes matching the criteria, but this only had a small effect.
Perhaps LLM optimizations that increased the size of the context window limit the LLM’s ability to examine this large window as effectively as a smaller window can be examined. Or perhaps it is optimized to answer a single question by finding a “needle in a haystack” in the large context, but doesn’t have enough “attention” to find “all the needles in the haystack”.
I am currently trying to work around this by using smaller chunks, but in smaller chunks I tend to have the opposite problem where quotations are included that I would prefer to have omitted. Perhaps in a smaller chunk the LLM feels it should try to include a minimum number of items and relaxes the criteria for what it extracts. I have tried to give objective criteria for what should and should not be included, but my domain is inherently somewhat subjective.
Unfortunately, this seems to be worse with the latest GPT 4 Turbo model. I started having this problem at 15k tokens with gpt-4-1106-preview. But with gpt-4-turbo-2024-04-09, the problem starts happening at 8k tokens. I haven’t been able to find a solution that matches the quality I was achieving with the earlier model.
Does anyone have more solid theories for why this would happen or tips on how to overcome this limitation? Thanks!
This is a well known “Needle in Haystack” problem where I think the beginning and ending of a prompt are “understood” better than the middle. So breaking into smaller chunks is good. Maybe ask the LLM to generate a score from 1 to 100 of how likely each paragraph is to match your criteria and then you make the cutoff yourself. That might get it to be more “honest” or reliable about how it evaluates the criteria.
Also I wonder if lowering the temperature would also give better results for this use case. Not sure.
Our experience of Chat GPT 4 web version is that in long extracts it is not analysing the content in its entirety however we engineer the prompt. My feeling is that it loses interests, or is taking a less precise, analytical approach.
We’ve consistently used Chat GPT4 for over a year for abstract analysis, looking at academic papers, being a French tutor, answering questions about specific books, authors, writing web content , writing code etc etc.
My perception is that it is not performing as it used to and that it creates more generalised verbiage than precise answers.
I have a lot of experience of engineering prompts to fine tune, improve, redirect etc etc.
This is my perception, maybe I’m expecting more and more from it having got over our initial WOW.
Yeah, it’s probably a positional bias and a bias towards returning a limited number of items (e.g., 10-15).
For example, if you ask the LLM to write bullet points summarizing a meeting transcript, it will always give you around 10 without regard to whether it’s a 20-minute meeting or a 2-hour meeting.
A surprising weakness I’ve seen is extracting all proper nouns. It’s pretty easy for humans, but even the best LLM seems to miss quite a few.
I agree about breaking the text down into smaller chunks. I review hundreds of reviews and online content using gpt4 and find it more accurate and faster to do so.
I’m trying to extract a list of all quotes from a long document that’s related to certain criteria using GPT-4o. I’m having the same issue. No matter how many actual quotes should be extracted (sometimes over 100), it always seems to return about 20.