I’m trying to summarise large tokens of input text using completions to pick out key facts common to my input data.
I have PDF RPFs being sent to me in a variety of formats and I want to pick out budgets, scope and key dates (submission deadline, project length, project completion date).
I’m parsing PDFs and then summarising text a paragraph at a time, however this approach isn’t optimal since not all facts appear in all paragraphs.
is there a preferred method using chunking or something similar to achieve what I want?
Edit - it’s similar to the approach I’ve been taking. I think the main difference here is that summarising a novel you can sacrifice detail. When analysing certain documents however, let’s say an RFP - you don’t want to lose certain details such as ‘Budget’ and ‘Deadlines’. I’m sure GPT-3 can handle this is given the right approach. For any input less than the token limit one shot is enough.
@daveshapautomator have you experimented with more structured extraction of features?
Edit 2 Textwrap lib seems like it can help with chunking and whitespace processing.
For anybody interested in this topic, I’ve also found a python library (not reliant on OpenAI) which may provide additional insights.
I will review and summarise findings on this thread.
Edit: It seems that this library uses a different approach that didn’t fit the use-case described above, although it has lead the NLTK library may help with chunking. I will test this approach and report back.
I was recently using the recursive summarizer for a legal doc. The person asked me to preserve certain data points like stock price and amount, which was getting lost in the summarization. All I did was change the prompt to say, “preserve…”, which worked great. I wonder if your problem might have a similar solution?
Yes that’s a good shout. I have seen wildly different results depending on the prompt. I’ll keep trying with this. I’m also looking at NLTK for better string splitting.
but are you sure that chunking it up doesn’t take away from the “essence” of the full story? that’s what i’m concerned about in using the chunking method.
It would be nice if OpenAI released an example script about how to replicate what they achieved summarizing the books from Project Gutenberg shown in this article Summarizing Books with Human Feedback