⬛ Splitting / Chunking Large input text for Summarisation (greater than 4096 tokens....)

Hi,

I’m trying to summarise large tokens of input text using completions to pick out key facts common to my input data.

I have PDF RPFs being sent to me in a variety of formats and I want to pick out budgets, scope and key dates (submission deadline, project length, project completion date).

I’m parsing PDFs and then summarising text a paragraph at a time, however this approach isn’t optimal since not all facts appear in all paragraphs.

is there a preferred method using chunking or something similar to achieve what I want?

Example Google Colab notebook here: Google Colab

Any advice on approach or code examples much appreciated.

The main reference I managed to find on this was here: Summarizing Books with Human Feedback

Regards,
Matthew

10 Likes

Check out @daveshapautomator github. He has a great recursive summarizer that uses a python module called textwrap for chunking documents.

7 Likes

Make sure to watch the accompanying video or it might not make much sense.

6 Likes

Amazing - checking it out now! Thank you.

Edit - it’s similar to the approach I’ve been taking. I think the main difference here is that summarising a novel you can sacrifice detail. When analysing certain documents however, let’s say an RFP - you don’t want to lose certain details such as ‘Budget’ and ‘Deadlines’. I’m sure GPT-3 can handle this is given the right approach. For any input less than the token limit one shot is enough.

@daveshapautomator have you experimented with more structured extraction of features?

Edit 2 Textwrap lib seems like it can help with chunking and whitespace processing.

Edit 3 @daveshapautomator I think some of what you talk about here will also help, in your example where you extract medical information and prognosis How to prevent Open AI from making up an answer

4 Likes

For anybody interested in this topic, I’ve also found a python library (not reliant on OpenAI) which may provide additional insights.

I will review and summarise findings on this thread.

Edit: It seems that this library uses a different approach that didn’t fit the use-case described above, although it has lead the NLTK library may help with chunking. I will test this approach and report back.

3 Likes

I was recently using the recursive summarizer for a legal doc. The person asked me to preserve certain data points like stock price and amount, which was getting lost in the summarization. All I did was change the prompt to say, “preserve…”, which worked great. I wonder if your problem might have a similar solution?

4 Likes

Yes that’s a good shout. I have seen wildly different results depending on the prompt. I’ll keep trying with this. I’m also looking at NLTK for better string splitting.

1 Like

but are you sure that chunking it up doesn’t take away from the “essence” of the full story? that’s what i’m concerned about in using the chunking method.

Long text in and of itself doesn’t tend to have a way of encoding past information in a sort of ‘long term memory’.

Nouns tend to be anchors to that end.

A single session of GPT-3 will only summarise the text (e.g. current paragraph or chunk that is being input).

If you wanted to provide a dense summary (single text) of a whole novel then an alternative approach is needed.

One approach might be to summarise (encode) in batches and then summarise the batches.

Follow this link for an overview of the topic -
https://venturebeat.com/2021/09/23/openai-unveils-model-that-can-summarize-books-of-any-length/amp/

Good luck!

1 Like

It would be nice if OpenAI released an example script about how to replicate what they achieved summarizing the books from Project Gutenberg shown in this article Summarizing Books with Human Feedback

3 Likes

Look for daveshap recursive summarizer. It works Great and produces the exact same results as openAI’s experiment

3 Likes

I found this video super helpful, and tried to build on it a little. I’ve been playing around with summarizing long pieces of text, and entire books. Sometimes the results are still not awesome summaries, no matter how I approach it.

Could I approach this maybe with embeddings more efficiently? If I for example fed an entire document into embeddings, could I turn around and ask for a summary? Maybe I’m completely not understanding embeddings here.

Hi @jdc2106

Sorry, no. The use case you describe above is not suitable for embeddings.

Embeddings are useful for (from the OpenAI docs):

  • Search (where results are ranked by relevance to a query string)
  • Clustering (where text strings are grouped by similarity)
  • Recommendations (where items with related text strings are recommended)
  • Anomaly detection (where outliers with little relatedness are identified)
  • Diversity measurement (where similarity distributions are analyzed)
  • Classification (where text strings are classified by their most similar label)

Hope this helps.

See also:

OpenAI Docs: Embeddings

3 Likes

I think a starting point for one approach is as follows:-

Decide on a chunking strategy. For example:

Split document into paragraphs (if not available use sentences) – call them chunks.

Take the next N chunks such that Sum of Number of Tokens is less than T.

T is the MAX_TOKEN_COUNT (4096, 8192 or whatever) – minus the required summary output length, minus the current CONTEXT_SUMMARY.

CONTEXT_SUMMARY is a buffer that you maintain separately that is a summary of context so far so that so that GPT completions has access to context summary.

It’s not a great approach but this is where I would start.

E.g. lets say you have a novel.

You would chunk it into paragraphs.

Take the first N paragraphs from Chapter 1. Summarise N and record the output O1

Take the next N paragraphs continuing Chapter 1. Summarise N and record the output O2

At the end of Chapter one, us O1, O2… Oi to create a Chapter 1 summary C1.

Then for chapter 2

Take C1, plus first M paragraphs from Chapter 2, Summarise M and record the Output P1.

In this way completions has access to the summary so far.

You can even get creative with your prompt engineering to facilitate this.

“Given SUMMARY which is a running summary of context so far, summarise the following ”

Good luck!

5 Likes

@matt_s This is really helpful, and it’s given me a new algo to think through. Thank you so much. I want to try an experiment to enhance the original method I started with from @daveshapautomator and mix in a bit of space for previous context. If I were to build a little on your idea:

  1. Divide up the text into proper chunks, say 0 . . . n.
  2. Have a “context” variable that you hold the previous summary in.
  3. As you summarize the n+1 paragraph, you introduce it as “This is the summary so far: (summary of 0-(n)) and this is the next paragraph (n). Summarize this paragraph.”

Gotta think about this. It’s less efficient but might produce better results. In some ways rather than having GPT read isolated chunks, it gets a certain amount of previous context for each chunk.

Maybe another way is to just take a human summary or description of the book, and front load this and describe where we are in the book . . . (this is a chunk 30% of the way through the book, which is about "short-human-description", write a summary.

Have you had success with this approach? Any more experience you can to share? I’m about to try implementing something along these lines.

No I haven’t, my use case is a little different, there are all sorts of interesting techniques for various situations.

My suggestion would be knock something crude together in Python or Node etc. and give it a try.

There might be a better approach with embeddings but I’ve not explored embeddings hardly at all.

What’s your ultimate goal with this?

There are a few, but they boil down to either creating an overall summary of, or an extraction of “key takeaways” from, bodies of text larger than the token limit. I’m trying to avoid important points or insights from the source text being “averaged out” via successive summarisation passes.

I’d love to get OpenAI to maintain a list of key takeaways for me so I could give it that context and a new chunk of text and say “If any of the takeaways from this new chunk of text are already in the list, move it up the list, otherwise add it to the bottom”. Then before I presented that context for the next new chunk of text I could trim items from the bottom if I was approaching the token limit. I’m finding it difficult though, because “if/then” instructions don’t work very well at all, and I can’t get OpenAI to repeat the existing context back to me with its answers appended.

(As an aside, I’m a software engineer and I find all of this nondeterministic coaxing a very frustrating way of getting computers do do things.)

1 Like

Yes, the way that “programming” is shifting is definitely different to classical deterministic programming.

I’ve come across another package that may or may not be helpful for you.

https://hwchase17.github.io/langchainjs/docs/modules/agents/agent_toolkits/vectorstore

This is an example from LangChain which is a toolkit for combining LLMs with other workflows. In this example a large document is fed in,m chunked, and then vectorised for querying, You might be able to use this approach to create a global summary of the document (I don’t see why not)…

1 Like

This is also relevant… https://hwchase17.github.io/langchainjs/docs/modules/chains/summarization/

1 Like