How to summarize a 2000-page PDF?

I want to create a solution that summarizes PDFs.

I understand that the OpenAI embedding model generates chunks for similarity searches, allowing us to retrieve the most relevant chunks for responses.

However, sending all those chunks to the assistant API is impractical with a 2000-page PDF.

How can I summarize the entire PDF without omitting any important content? and Do no lose any topic.

2 Likes

Direct comparisons are not possible for documents larger than 128,000 tokens.

You could try semantically chunking the document via a semantic chunker, The one I make use of is by Llama Index and can chunk documents by semantic meaning.

Usually, these chunks are then vectorised and stored in a vector database for semantic similarity searching. However, you could take those chunks and pass them to an OpenAI GPT model for summarisation, you could then include all of the summaries in one document which is now hopfully small enough to parse in a single prompt.

5 Likes

I understand.

Below what i am doing right now.
I create chunks of 300 tokens from a PDF, generate vectors, and store them in a vector database. When I have a question, I generate a vector and perform a cosine similarity search. However, due to the token limit, I’m unable to process the entire PDF. What is the best way to summarize the whole PDF instead of working within the token limit?

1 Like

@Foxalabs explained this pretty well already. For a good overview check out LlamaIndex.

Summarise any context that is too long.

Good luck! :hugs:

2 Likes

Obviously, your goal is impossible on the face. “Summarize the Lord of the Rings trilogy, and do not lose any details”.

2000 one page summaries
200 summaries of ten summaries
20 summaries of ten 10-page summaries…
…
…
Result: “Hobbit journey”

4 Likes

Great example!

Think of summarising like compressing an image with JPEG, the more you summarise (compress), the more details will get lost.
In a JPEG, you begin seeing artifacts etc.
The same happens with the data for the AI.

So - summarising can be great when you can’t have an unlimited context window, but too much can make the data unusable. You have to figure out a compromise for yourself. :hugs:

2 Likes

Most of the summerizer tools out there but they didn’t provide full summarises i have seen people complaining about key topics, which i was trying to solve. :thought_balloon:

1 Like

It is a technical limitation.
Check this out for more information.
Currently it is impossible to solve your problem as the AI won’t have a sufficient context length.

2 Likes

It is also possible to instill pretraining on the documents themselves as knowledge, but still, a task of summary inference is a hard completion task to train, well beyond producing from or in the style of the writing.

Give o1-preview such a task, just for amusement…

Summary: “Shakespeare’s works reveal the timeless complexities of human nature—love, power, betrayal, fate—illuminating profound truths about the human condition.”

Prompt

You are an Oxford professor with tenure in literature, with a dedication to and your own year-long graduate-level course in the combined works of Shakespeare, with award-winning publications on the subject and are renowned in your original analyses of the Bard’s lifetime works. You are a featured speaker on Shakespeare besides your admiration among colleagues, and yearly, you dedicate yourself to a complete re-reading of the entire works in their original Early Modern English. You have just performed this re-reading from start to finish, comedies to tragedies, also with performances staged at The Globe fresh in mind (by actors also with life-long dedication to portrayals of the works) in preparation for this final simple question answered remarkably based on the concrete evidence in stories, plots, characters, subtext, a statement which will serve as an insightful legacy of your entire lifetime and career of enthusiastic pursuit of understanding of the works of William Shakespeare.

Question: “Summarize the complete works of William Shakespeare, 20 words or less.”

3 Likes

Start by breaking it into chapters, and then possibly smaller sections if needed, and summarize those instead. Recurse if needed to get everything into one 4000-token doc, but you can’t expect zero loss of subjects if you’re aiming this small. There are a lot of modules out there which can help with this, not that it’s super-easy. My big challenge has been that PDF’s are often full of images which are needed to understand the document, and I haven’t figured out a great way to incorporate that context yet. You can chat with the o1 model about the problem, it has good advice but it’s not a problem that can be solved right now with one-size-fits all answer.

Another approach might be to have it write an outline for pages 1-10, 6-15, 11-20. Then give it the three outlines and have it give you back separate outlines for pages 1-5, 6-10, 11-15 and 16-20. You’re probably going to throw away the one for pages 16-20 but you should ask for it so that the rest of the output is consistent. Continue this process for the entire doc (16-25, 21-29, 26-35, etc.) and glue all of the 5-page outputs together. At each step, you’re want to give it enough of the previously-generated outline to figure out what letters and depth level it should be at.

This process should give you a pretty complete hierarchial list of all subjects. I’m sure there will be flaws in the structure it outputs but it’s a starting point for this as a research project.

Want to try https://www.simantiks.com ? That would be a nice challenge for me.

If not him, I’m interested in giving you a shot. I’ve got lots of large, challenging PDF’s (mostly technical manuals) I’d love to outsource converting into bite-sized KB articles (or eventually even structured data) if the price and quality is right.

1 Like

The token costs for processing might be quite high to handle 2k pages in one file. But I’m willing to try on a smaller number of pages for free, say a 100 first pages (I’ll cut it out of the pdf). Then if the result/price is ok for you (we’ll know after the first 100) - we can have a deal.

The KB can be easily built from the JSON the solution gives back. See the examples on the website.

The smaller files 1-75 pages are ok to go with no issues. I run legal analysis on a very similar engine with 20-100 pages contracts (especially in franchising domain it is usually closer to 100-200 than to 20).

feed it to perplexity
v. n. !

Looking to summarize a 2000-page PDF? Why not map key topics to emojis and create an emoji masterpiece! :art::books: (Just kidding… unless you’re into that kind of thing! :joy:)

Instead, leverage Hierarchical Summarization with Dynamic Chunking and a Chain of Density approach. Break the document into manageable sections, summarize each with precision, then apply the Chain of Density to further condense using OpenAI’s API. This ensures no crucial info is lost and keeps everything neat.

1 Like

Application Local : LangChain + ChromaDB + API Openai