For anybody interested in this topic, I’ve also found a python library (not reliant on OpenAI) which may provide additional insights.

I will review and summarise findings on this thread.

Edit: It seems that this library uses a different approach that didn’t fit the use-case described above, although it has lead the NLTK library may help with chunking. I will test this approach and report back.

3 Likes

I was recently using the recursive summarizer for a legal doc. The person asked me to preserve certain data points like stock price and amount, which was getting lost in the summarization. All I did was change the prompt to say, “preserve…”, which worked great. I wonder if your problem might have a similar solution?

3 Likes

Yes that’s a good shout. I have seen wildly different results depending on the prompt. I’ll keep trying with this. I’m also looking at NLTK for better string splitting.

1 Like

but are you sure that chunking it up doesn’t take away from the “essence” of the full story? that’s what i’m concerned about in using the chunking method.

Long text in and of itself doesn’t tend to have a way of encoding past information in a sort of ‘long term memory’.

Nouns tend to be anchors to that end.

A single session of GPT-3 will only summarise the text (e.g. current paragraph or chunk that is being input).

If you wanted to provide a dense summary (single text) of a whole novel then an alternative approach is needed.

One approach might be to summarise (encode) in batches and then summarise the batches.

Follow this link for an overview of the topic -
https://venturebeat.com/2021/09/23/openai-unveils-model-that-can-summarize-books-of-any-length/amp/

Good luck!

1 Like

It would be nice if OpenAI released an example script about how to replicate what they achieved summarizing the books from Project Gutenberg shown in this article Summarizing Books with Human Feedback

3 Likes

Look for daveshap recursive summarizer. It works Great and produces the exact same results as openAI’s experiment

3 Likes

I found this video super helpful, and tried to build on it a little. I’ve been playing around with summarizing long pieces of text, and entire books. Sometimes the results are still not awesome summaries, no matter how I approach it.

Could I approach this maybe with embeddings more efficiently? If I for example fed an entire document into embeddings, could I turn around and ask for a summary? Maybe I’m completely not understanding embeddings here.

Hi @jdc2106

Sorry, no. The use case you describe above is not suitable for embeddings.

Embeddings are useful for (from the OpenAI docs):

  • Search (where results are ranked by relevance to a query string)
  • Clustering (where text strings are grouped by similarity)
  • Recommendations (where items with related text strings are recommended)
  • Anomaly detection (where outliers with little relatedness are identified)
  • Diversity measurement (where similarity distributions are analyzed)
  • Classification (where text strings are classified by their most similar label)

Hope this helps.

See also:

OpenAI Docs: Embeddings

2 Likes

I think a starting point for one approach is as follows:-

Decide on a chunking strategy. For example:

Split document into paragraphs (if not available use sentences) – call them chunks.

Take the next N chunks such that Sum of Number of Tokens is less than T.

T is the MAX_TOKEN_COUNT (4096, 8192 or whatever) – minus the required summary output length, minus the current CONTEXT_SUMMARY.

CONTEXT_SUMMARY is a buffer that you maintain separately that is a summary of context so far so that so that GPT completions has access to context summary.

It’s not a great approach but this is where I would start.

E.g. lets say you have a novel.

You would chunk it into paragraphs.

Take the first N paragraphs from Chapter 1. Summarise N and record the output O1

Take the next N paragraphs continuing Chapter 1. Summarise N and record the output O2

At the end of Chapter one, us O1, O2… Oi to create a Chapter 1 summary C1.

Then for chapter 2

Take C1, plus first M paragraphs from Chapter 2, Summarise M and record the Output P1.

In this way completions has access to the summary so far.

You can even get creative with your prompt engineering to facilitate this.

“Given SUMMARY which is a running summary of context so far, summarise the following ”

Good luck!

5 Likes

@matt_s This is really helpful, and it’s given me a new algo to think through. Thank you so much. I want to try an experiment to enhance the original method I started with from @daveshapautomator and mix in a bit of space for previous context. If I were to build a little on your idea:

  1. Divide up the text into proper chunks, say 0 . . . n.
  2. Have a “context” variable that you hold the previous summary in.
  3. As you summarize the n+1 paragraph, you introduce it as “This is the summary so far: (summary of 0-(n)) and this is the next paragraph (n). Summarize this paragraph.”

Gotta think about this. It’s less efficient but might produce better results. In some ways rather than having GPT read isolated chunks, it gets a certain amount of previous context for each chunk.

Maybe another way is to just take a human summary or description of the book, and front load this and describe where we are in the book . . . (this is a chunk 30% of the way through the book, which is about "short-human-description", write a summary.

Have you had success with this approach? Any more experience you can to share? I’m about to try implementing something along these lines.

No I haven’t, my use case is a little different, there are all sorts of interesting techniques for various situations.

My suggestion would be knock something crude together in Python or Node etc. and give it a try.

There might be a better approach with embeddings but I’ve not explored embeddings hardly at all.

What’s your ultimate goal with this?

There are a few, but they boil down to either creating an overall summary of, or an extraction of “key takeaways” from, bodies of text larger than the token limit. I’m trying to avoid important points or insights from the source text being “averaged out” via successive summarisation passes.

I’d love to get OpenAI to maintain a list of key takeaways for me so I could give it that context and a new chunk of text and say “If any of the takeaways from this new chunk of text are already in the list, move it up the list, otherwise add it to the bottom”. Then before I presented that context for the next new chunk of text I could trim items from the bottom if I was approaching the token limit. I’m finding it difficult though, because “if/then” instructions don’t work very well at all, and I can’t get OpenAI to repeat the existing context back to me with its answers appended.

(As an aside, I’m a software engineer and I find all of this nondeterministic coaxing a very frustrating way of getting computers do do things.)

1 Like

Yes, the way that “programming” is shifting is definitely different to classical deterministic programming.

I’ve come across another package that may or may not be helpful for you.

https://hwchase17.github.io/langchainjs/docs/modules/agents/agent_toolkits/vectorstore

This is an example from LangChain which is a toolkit for combining LLMs with other workflows. In this example a large document is fed in,m chunked, and then vectorised for querying, You might be able to use this approach to create a global summary of the document (I don’t see why not)…

1 Like

This is also relevant… https://hwchase17.github.io/langchainjs/docs/modules/chains/summarization/

1 Like

Thank you so much for providing these langchain links! Exactly what I needed.
I tried to explain a little bit in layman terms how embeddings work and how they can be used.
I think summarizing everything before “needing them” might be an expensive overkill, as it is significantly more expensive than embeddings.

I am thinkibg about creating “rolling” embeddings with 2k-long overlap, so whenever I detect this “long but interesting document part” I can process only it doing iterations. I will test the approach in the next days

1 Like

My friend is developing a tool dedicated to this task that works in server and client side JS

Any feedback appreciated :slight_smile:

Update: moved here Embedbase Documentation

Thank You so much for providing solution of this problem but i want to pass the list of reviews say 10,000 hotel review and generate a summary of the given list of reviews so How can I split the list of reviews.

Hello. I’m working on solution that combines summarisation and extraction. Basically I need to make sure that every important information from the call is recorded in database.

Most talks are under the limit however some of them are over 8k tokens.

I’m wondering how small chunks for good summarisation should be. I expect that the smaller chunks are → the more information is extracted however there is also higher chances for “hallucinations”.

What’s your opinion on that. Which size is optimal, when retrieving data from the dialogue is important.