⬛ Splitting / Chunking Large input text for Summarisation (greater than 4096 tokens....)

Hi,

I’m trying to summarise large tokens of input text using completions to pick out key facts common to my input data.

I have PDF RPFs being sent to me in a variety of formats and I want to pick out budgets, scope and key dates (submission deadline, project length, project completion date).

I’m parsing PDFs and then summarising text a paragraph at a time, however this approach isn’t optimal since not all facts appear in all paragraphs.

is there a preferred method using chunking or something similar to achieve what I want?

Example Google Colab notebook here: Google Colab

Any advice on approach or code examples much appreciated.

The main reference I managed to find on this was here: Summarizing Books with Human Feedback

Regards,
Matthew

2 Likes

Check out @daveshapautomator github. He has a great recursive summarizer that uses a python module called textwrap for chunking documents.

4 Likes

Make sure to watch the accompanying video or it might not make much sense.

2 Likes

Amazing - checking it out now! Thank you.

Edit - it’s similar to the approach I’ve been taking. I think the main difference here is that summarising a novel you can sacrifice detail. When analysing certain documents however, let’s say an RFP - you don’t want to lose certain details such as ‘Budget’ and ‘Deadlines’. I’m sure GPT-3 can handle this is given the right approach. For any input less than the token limit one shot is enough.

@daveshapautomator have you experimented with more structured extraction of features?

Edit 2 Textwrap lib seems like it can help with chunking and whitespace processing.

Edit 3 @daveshapautomator I think some of what you talk about here will also help, in your example where you extract medical information and prognosis How to prevent Open AI from making up an answer

1 Like

For anybody interested in this topic, I’ve also found a python library (not reliant on OpenAI) which may provide additional insights.

I will review and summarise findings on this thread.

Edit: It seems that this library uses a different approach that didn’t fit the use-case described above, although it has lead the NLTK library may help with chunking. I will test this approach and report back.

I was recently using the recursive summarizer for a legal doc. The person asked me to preserve certain data points like stock price and amount, which was getting lost in the summarization. All I did was change the prompt to say, “preserve…”, which worked great. I wonder if your problem might have a similar solution?

2 Likes

Yes that’s a good shout. I have seen wildly different results depending on the prompt. I’ll keep trying with this. I’m also looking at NLTK for better string splitting.

1 Like

but are you sure that chunking it up doesn’t take away from the “essence” of the full story? that’s what i’m concerned about in using the chunking method.

Long text in and of itself doesn’t tend to have a way of encoding past information in a sort of ‘long term memory’.

Nouns tend to be anchors to that end.

A single session of GPT-3 will only summarise the text (e.g. current paragraph or chunk that is being input).

If you wanted to provide a dense summary (single text) of a whole novel then an alternative approach is needed.

One approach might be to summarise (encode) in batches and then summarise the batches.

Follow this link for an overview of the topic -
https://venturebeat.com/2021/09/23/openai-unveils-model-that-can-summarize-books-of-any-length/amp/

Good luck!

1 Like

It would be nice if OpenAI released an example script about how to replicate what they achieved summarizing the books from Project Gutenberg shown in this article Summarizing Books with Human Feedback

Look for daveshap recursive summarizer. It works Great and produces the exact same results as openAI’s experiment

1 Like